PIA - Reinforcement Learning

  1. RL Intro
  2. Dynamic Programing
  3. Tabular Q-Learning
  4. Deep Reinforcement Learning (DQN)

4. Deep Reinforcement Learning (DQN)

Welcome to the final Reinforcement Learning Notebook. In this part you will implement the Deep Q-Learning Algorithm that was used by Mnih et al. to play Atari Video games. The resulting agent is called Deep Q-Network agent (or shorter DQN agent) because it uses a Deep Neural Network to approximate the value function (instead of saving it in a table).


Course of Action

  • Please write all executable python code in Code-Cells (Cell->Cell Type->Code) and all Text as Markdown in Markdown-Cells
  • Describe your thinking and your decisions (where appropriate) in an extra Markdown Cell or via Python comments
  • In general: discuss all your results and comment on them (are they good/bad/unexpected, could they be improved, how?, etc.). Furthermore, visualise your data (input and output).
  • Write a short general conclusion at the end of the notebook
  • Further experiments are encouraged. However, don't forget to comment on your reasoning.
  • Use a scientific approach for all experiments (i.e. develop a hypothesis or concrete question, make observations, evaluate results)

Submission

E-Mail your complete Notebook to maucher@hdm-stuttgart.de until the start of the next lecture. One Notebook per Group is enough. Edit the teammember table below.

Important: Also attach a HTML version of your notebook (File->Download as->HTML) in addition to the .ipynb-File.

Teammember
1. Christopher Caldwell
2. Fabian Müller
3. An Dang

Prerequisites


Theory

In the last notebook you have implemented the model-free Q-Learning algorithm and solved the full reinforcement learning problem by learning from samples. In this context, full refered to the fact that we dont have acess to the world model and model-free to the fact that we have not tried to learn that model. Furthermore, Q-Learning performed online updates to the policy, meaning that we have adjusted the policy online after every time step. Finally Q-Learning is an off-policy algorithm because we followed an e-greedy behavior policy while we have performed updates according to a greedy target policy. Now we will tackle the curse of dimensionality by approximating the value function instead of saving it explicitly in a table.

Case Study - Video Games

DQN

Before we proceed to the solution, let us quickly revise the actual problem that we are trying to solve. Consider the task of learning to play a video game given only the raw game screen as input. This is similar to how humans would play the game. Since the game screen is typically represented as raw pixels, this leaves us with a really high dimensional input or state space because every change of pixels represents a new and distinct state of the game, even if the change seems completely insignificant to you! Remember, the agent has no real knowledge of the game (or world model). Clearly it is infeasible to store every possible state of pixle combinations in a table. See Reinforcement Learning: An Introduction chapter 16.5 for a comprehensive discussion.

DQN

The problem that is solved by Deep Reinforcement Learning (in the case of DQN) is how to learn a mapping from a high dimensional input space to action values. This mapping represents the value function and can be used in a policy, e.g. to choose the best action with the highest value.

Nonlinear Function Approximation with Artificial Neural Networks

First of all, a lookup table can mathematically be seen as a very simple form of a function, i.e. a direct mapping of values (hence the name value function). However, for the reasons explained above, this approach does not scale to high dimensional input spaces. A typical solution to this problem is to replace the perfect but intractable lookup table with a more complicated function that only approximates the true value function but is computationally tractable. In the case of DQN we choose a deep neural network as our function approximator. Formally, this new function is denoted as $\hat{Q}$ and we write

$$\begin{eqnarray} \hat{Q}(s,a,\theta) \approx Q_{\pi}(s,a) \end{eqnarray}$$

where $\theta$ are the parameters of the neural network. In other words, the value function now depends on those parameters and the task of finding and optimal value function turns into the task of finding an optimal set of parameters for the network. Fortunately, we know how to train and optimize a neural network with SGD and backpropagation given an appropriate loss function. Inside the RL-framework we can use the TD-error as the loss function. Formally we optimize:

$$\begin{eqnarray} L_i(\theta_{i}) = \Big( \underbrace{r_{t+1} + \gamma \max_a Q(s_{t+1}, a; \theta_i) - Q(s_t, a_t;\theta_i)}_{TD-error} \Big)^2 \end{eqnarray}$$

Note that in order to obtain any action-values, we now need to perform a forward pass through the network. In practice, this means two forward passes before we can calculate the loss, one pass for the value of $Q(s_t, a_t;\theta_i)$ and another one to calculate the value of $\max_a Q(s_{t+1}, a; \theta_i)$. More details on that later.

Instabilities and Solutions

So far, so good. By using the TD-error as loss function we can train the network in a supervised learning like setup. Sadly it is not that easy. Remember that in supervised learning we assumed the data to be independent and identically distributed (iid-data) in order for SGD to work properly. This assuption does not hold in reinforcement learning where subsequent data is highly correlated and in contrast, depends strongly on the agents last choice of actions. This inherent sequential property, in combination with an off-policy algorithm and a non-linear function appoximator such as a neural network, results in the problem that the learnable network parameters are at risk to oscillate or even diverge catastrophically during training. In theory, there is no convergence guarantee whatsoever. In practice, Mnih et al. found two major ways in which the training process can be stabilized:

  • Experience Replay - This idea introduces a so called replay buffer $\mathcal{D}$ which stores the last $N$ state transitions as experience tuples $(S,A,R,S')$. In other words, the agent saves its recent history to a buffer. This way, experience can be reused and the correlation between samples can be broken by drawing random minibatches of experience $U(\mathcal{D})$ from it during the training.

  • Fixed Q-Targets - The second idea is to keep a separate set of parameters $\theta^{−}$ for calculating the TD target. This set is basically a copy of $\theta$ that is held fixed for some time $t$ and periodically gets swapped with the current parameter values in order to allow progress. Mnih et al. have shown that updating $\theta$ towards such fixed Q-targets is another effective way to stabilize the training process. In practice, this means that we have basically two separate networks which we will distinguish by their different set of parameters $\theta^{-}$ and $\theta$. We will refer to them respectively as Target- and Q-Network.

As a result, the Q-learning update of DQN at iteration $i$ uses the following loss function:

$$\begin{eqnarray} L_i(\theta_{i}) = \mathbb{E}_{(s,a,r,s') \sim U(\mathcal{D})} \Bigg[ \Bigg( r_{t+1} + \gamma \max_a \underbrace{Q(s_{t+1}, a; \theta^{-}_{i})}_{Target-Network} - \underbrace{Q(s_t, a_t;\theta_i)}_{Q-Network} \Big)^2 \Bigg] \end{eqnarray}$$

And thats it! We can use this update rule inside the Q-Learning algorithm to train a Deep Q-Network with SGD as we do in supervised learning. The corresponding Deep Q-Learning algorithm is given in the next part.


Implementation

As in the previous notebooks we will implement the DQN algorithm step by step. While the original DQN architecture was a CNN trained on Atari games, we will choose a much simpler problem and architecture. This way you can verify and debug your implementation much faster (in minutes vs hours...). However, the algorithm itself is still the same and extending it should be straightforward after completing the notebook. Though, this is left to the further ideas part depending on your time and motivation.

The following is an overview of all the parts you need. Use it as a checklist if you get lost. Like with Q-Learning, try first to verify that all the sub-parts are working as expected. If you are confident, integrate them iteratively into the main loop. There is no single best approach how to proceed so feel free to jump back and forth between the cells as you like.

Overview
  • The OpenAI Gym Environment
  • Replay Buffer
  • Epsilon Schedule
  • Deep Q-Network (and computation graph)
  • E-Greedy Policy (action selection)
  • Update the Target Network
  • Train Method
  • Main Loop

  • Evaluation of Deep Reinforcement Learning Algorithms

The Open AI Gym Environment

You will use the OpenAI Gym environment to solve a classic control task known as Cart Pole Balance. The great thing about the gym environment is that it offers a common interface to many different environments. That way you can easily test your algorithms on different tasks, e.g. switch from an easy one like CartPole to more challenging ones like an Atari game etc. ;)

For now, we will solve the CartPole-v0 task.

  1. First of all, go and read about its most important details such as the observations, actions, rewards, it's max length etc.
  2. Second, get used to the gym interface. Run a random agent for some episodes etc. The most important API calls are:
    • gym.make('CartPole-v0') returns a new game.
    • The game's action_space and observation_space variables.
    • reset() - returns an initial observation.
    • step() - takes an action int, returns an observation, reward, game_over, info tuple.
    • render() - renders the current game state.
    • close() - call this after the last episode has ended to clean up.
In [1]:
import gym
import numpy as np

game = gym.make("CartPole-v0")
observation_space = game.observation_space.shape[0]
action_space = game.action_space.n
i=0
while i in range(0,4):
    state = game.reset()
   
    state = np.reshape(state, [1, observation_space])
    #DEBUG print(state)
    x = 0
    while x in range(0,200):
        state_next, reward, terminal, info  = game.step(np.random.randint(0,action_space))
        #DEBUG print("State: ",state_next)
        #DEBUG print("Reward: ",reward)
        #DEBUG print("Terminate? ",terminal)
        #DEBUG print("Info: ",info)
        
        if terminal:
            #DEBUG print("END OF GAME!!!!!!!")
            x=201
        
        game.render()
        x+=1 
    i+=1
game.close()

Replay Buffer

The replay buffer should store the last $N$ experience tuples. This is basically a FIFO queue and practically, python offers such a data structure called deque. If initialized with a maxlen parameter, deque's append method will pop items from the left automatically when the list grows beyond the given maxlen. This is exactly what we want and you can implement it in just a few lines of code! The replay buffer should have the following methods:

  • __init__ constructor, initializing an internal deque with a given maxlen or $N$ or better, call it buffer_size.
  • add method, appending a new [state, action, reward, next_state, done] tuple. (done is the game_over information)
  • sample method, sample a random batch of training data of size batch_size. You can use random's sample method for that.

Use the cells below to test your implementation, e.g. by filling it with some integers in a loop, check whats in the queue and test the sampling method etc.

In [2]:
import random as r
from collections import deque

class ReplayBuffer():
    def __init__(self, buffer_size):
        self.buffer = deque([], maxlen=buffer_size)
    
    def add(self, state, action, reward, next_state, done):
        self.buffer.append([state, action, reward, next_state, done])
        #print(self.buffer)
    
    def sample(self, sample_size):
        return(r.sample(self.buffer, sample_size))
            
        
In [3]:
# TESTING the replay buffer
size = 3
replay_buffer = ReplayBuffer(size)
i = 0
while i < size+5:
    replay_buffer.add(1,2,3,4,i)
    i+=1
replay_buffer.sample(3)
    
Out[3]:
[[1, 2, 3, 4, 5], [1, 2, 3, 4, 6], [1, 2, 3, 4, 7]]

Epsilon Schedule

Last time, we calculated the current epsilon value inside the main loop. This time we need a little bit more control so let's create a class for that task. The reason for that is that we have to pre fill the replay buffer with some initial random experience before we can sample from it and start with the actual training. We want to control the amount of initial experience with a pre_train_steps variable. During this time, the schedule should return the start_epsilon value so that the agent behaves fully random. After that, the normal decay should be applied. The implementation needs two methods:

  • __init__ constructor, takes all hyper parameters for the schedule such as start_epsilon, final_epsilon, pre_train_steps, final_exploration_step, pre calculate the decay value per step here.
  • value method, takes a time step t and returns a correpsonding epsilon value. If t is smaller or greater than the pre_train_steps or final_exploration_step return the fixed values accordingly. In between calculate the decayed epsilon value at time t.

Use the code in the cell below to test and visualize your schedule.

In [4]:
import numpy as np

class LinearSchedule():
 
    def __init__(self, start_epsilon, final_epsilon, pre_train_steps, final_exploration_step):
    # You code comes here
        self.start_epsilon = start_epsilon
        self.final_epsilon = final_epsilon
        self.pre_train_steps = pre_train_steps
        self.final_exploration_step = final_exploration_step
        
        self.decay = (self.start_epsilon - self.final_epsilon) / (self.final_exploration_step - self.pre_train_steps)

    
    def value(self, t):      
        if t<=self.pre_train_steps:
            return self.start_epsilon
        else:
            if t>self.final_exploration_step:
                return self.final_epsilon
            else:
                current_epsilon = self.start_epsilon-self.decay*(t-self.pre_train_steps)
                return current_epsilon
In [5]:
# TESTING the epsilon schedule
%matplotlib inline
import matplotlib.pyplot as plt

schedule    = LinearSchedule(1.0, 0.1, 100, 1000)
test_points = [schedule.value(t) for t in range(1100)]

plt.plot(test_points)
Out[5]:
[<matplotlib.lines.Line2D at 0x7f7621d5d6d8>]

Deep Q-Network

The original DQN agent included a CNN as shown in the theory part of this notebook. For our task however, a simple MLP with only one hidden layer should be enough. Starting that simple will help you to get other implementation details right. Later on you can easily scale up and switch the MLP for a more powerfull network.

While the architecture will be easy, the DQN algorithm requires us to keep basically two separate networks, namely, a main Q-Network and a second Target-Network. In addition, Tensorflow requires us to keep a reference to every node we want to calculate with the sess.run command. For those reasons we will build a generic and reusable DQNetwork class and bind all graph nodes to the object. Later we can then simply use the instance objects to reference specifc nodes in a clean and readable way.

We will implement the DQNetwork class below in two steps. The first part denoted as basic Deep Q-Network, includes the actual network. The second part denote as Q-Learning Calculations, includes all the additional calculations to train the network.

Part 1 - Basic Deep Q-Network

This part is identical for the main Q- and the Target-Network. It includes all trainable variables of the graph! The model should be a fully connected feedforward network with one hidden layer of size $64$ and ReLU activations. For the CartPole task, the resulting MLP will consequently be later of size num_inputs=4, num_hidden=64, num_outputs=2. For the generic DQNetwork class however, let the user specify those values as parameters.

  • Now create a self.best_action node which should take the output layer (or q-values) and return the indice of the maximum action value. You can use tf.argmax for that. You can query this node in the e-greedy action selection later.

  • Create another node, self.max_q which does the same thing but returns the value of the maximum action value. You can use tf.reduce_max for that. You will need this node for the calculation of the TD-target later.

So far, everything should have been very straightforward.

Part 2 - Q-Learning Calculations

This part is different for the Q- and the Target-Network. It includes all the necessary calculations for training the network. The code skeleton below shows how to constrain the graph creation with two simple if statements.

First, the Target Stream. This part will later calculate the TD-Target (or $y_i$) with the following equation:

$$\begin{eqnarray} y_i = r_{t+1} + \gamma \max_a \underbrace{Q(s_{t+1}, a; \theta^{-}_{i})}_{Target-Network} \end{eqnarray}$$

Remember that we will train with mini-batches sampled from the replay buffer. For that reason, the reward values will be provided by mini-batches. This means we have to feed them externally, so let's create a placeholder node for those. Next, gamma will be a constant so let's create one in tensorflow and let the user specify it as a parameter at creation time of the network. Next, $s_{t+1}$ are the next_states from the buffer. This is not important here but you have to feed them correctly later! To get the maximum action value, use the self.max_q node that you have created earlier. Finally, there is a small detail we have not talked about yet. In the rare case that the next state is a final state, i.e the game is over, only the reward should be taken into account. To realize this here is a little trick: create another placeholder for the boolean done values from the mini-batch. Then, multiply the right-hand side of the equation with tf.abs(self.done - 1). For clarity call the final node which implements the equation above self.td_target!

Note that this is a lot of text but basically 4 lines of code!

Second, the Q Stream. This part will calculate the full TD-Error and optimize the mean squared error on the mini-batch.

$$\begin{eqnarray} L_i(\theta_{i}) = \mathbb{E}_{(s,a,r,s') \sim U(\mathcal{D})} \Bigg[ \Bigg( y_i - \underbrace{Q(s_t, a_t;\theta_i)}_{Q-Network} \Big)^2 \Bigg] \end{eqnarray}$$

Now first, create a placeholder for the td-target ($y_i$). These values have been calculated by the Target-Network and we must feed them again to the main Q-Network. Yes, its really two separate networks even though they share some code here. Next we need the value of $Q(s_t, a_t;\theta_i)$. We can query all action values by feeding the mini-batch of states ($s_t$) from the buffer to the network (first part of the network). The problem is that we only want to select the action value of the action that was actually taken. Luckily we have this information as part of the mini-batch. We can feed it with an additional placeholder, let's call it self.actions. The idea is now to mask the output of the network with a one-hot encoded representation of the action indices from the placeholder. To do this use tf.one_hot and call the resulting node self.actions_onehot or similar. Finally multiply the one_hot vector with the output of the network and call tf.reduce_sum on the result to obtain a clean list of the action values we want. The resulting node, let's call it self.Q holds the values of $Q(s_t, a_t;\theta_i)$ for the complete mini-batch.

If you feel uncomfortable with this trick create a toy example in a separate cell in order to understand whats going on here.

Great, now the rest should be straightforward again. Square the difference of td_target and Q to obtain the TD-error and use tf.reduce_mean on that to obtain the expected loss of the mini-batch. This corresponds to $\mathbb{E}_{(s,a,r,s') \sim U(\mathcal{D})}$ in the formula. Create an optimizer node such as the AdamOptimizer and again, let the user specifiy the learnin rate as a parameter at creation time of the network. Finally minimize the loss and call this final node something like self.updateModel or self.train.

Note that this is again a lot of text but only something like ~8 lines of code!

In [6]:
import tensorflow as tf

import tensorflow as tf

class DQNetwork():
    
    def __init__(self, scope, num_inputs, num_hidden, num_outputs, gamma, learning_rate):

        self.scope = scope
        
        with tf.variable_scope(self.scope):
            
            # ---------------------
            # Basic Deep Q-Network
            # ---------------------
            # You code comes here     
            self.state = tf.placeholder(tf.float32, shape=[None,num_inputs], name="StatePlaceholder")
                 
            self.hidden = tf.layers.dense(inputs=self.state, units=num_hidden, activation=tf.nn.relu, name="hiddenLayer")
                 
            self.logits = tf.layers.dense(inputs=self.hidden, units=num_outputs, name="logits")
            
            self.best_action = tf.argmax(self.logits, name="best_action", axis=1)
            
            self.max_q = tf.reduce_max(self.logits)
                 
             
            
            # ------------------------
            # Q-Learning Calculations
            # ------------------------
               
            if scope == 'Target':
                # You code comes here  
                self.reward_placeholder = tf.placeholder(tf.float32, shape=None, name="RewardPlaceholder")
                
                self.gamma = tf.constant(value=gamma, name="GammaConstant")
                
                self.done_placeholder = tf.placeholder(tf.int32, shape = None, name="DonePlaceholder")
                
                self.td_target = self.reward_placeholder + ((gamma*self.max_q)*tf.cast(tf.abs(self.done_placeholder - 1), tf.float32))
                
                
              
                
            if scope == 'Q':
                # You code comes here
                self.td_target_placeholder = tf.placeholder(tf.float32, shape=None, name="TD-TargetPlaceholder")
                
                self.actions = tf.placeholder(tf.uint8, shape=None, name="ActionTakenPlaceholder")
                
                self.actions_onehot = tf.one_hot(self.actions, depth=num_outputs)
                
                self.masked_logits = tf.multiply(self.actions_onehot, self.logits)
                
                self.Q = tf.reduce_sum (self.masked_logits)
                
                self.td_error = self.td_target_placeholder - self.Q
                
                self.expected_loss = tf.reduce_mean(self.td_error**2)
                
                self.optimizer = tf.train.AdamOptimizer(learning_rate, name="AdamOptimizer")
                
                self.update_model = self.optimizer.minimize(self.expected_loss, name="minimize_loss")
                
    
In [7]:
# A cell for testing

E-Greedy Policy

As in the Q-Learning notebook, let us encapsulate the action selection into a separate method. This time however, selecting a greedy max action requires us to perform a forward pass through the Q-Network. The method per se remains as simple as in the Q-Learning case. In order to perform the forward pass of the network though, we have to hand over a reference to the current Tensorflow session object from the main loop etc.

Note that the sess.run call will most certainly return a list, containing a single action indice. Make sure to unpack it properly before returning it. You can test this in separate cell or in the main loop by using a fixed epsilon value.

In [8]:
import numpy as np


def choose_egreedy_action(session, current_state, network, epsilon):

    # Create Probability variable for choosing a random action (first Value = exploration) and a learned action (second Value = exploitation)
    probability = [epsilon, 1-epsilon]
    
    #DEBUG print("Probability: ", probability)
    
    random_method = [0,1]
    
    #np.random.choice chooses one ethod according to the probability
    action_choice = np.random.choice(random_method, p=probability)
    
    # Exploration Choice | action_choice = 0 | A High Epsilon value results in a high probability for exploring action
    if action_choice == 0:
        # Choose a random Action out of two possibilities (Action 0 and Action 1)
        action = np.random.choice(2)
     
    # Exploitation Choice | action_choice = 1 | A low Epsilon value results in a high probability for exploiting action
    else:
        # Reshape current state to fit for the Network (the first dimesion for batch size)
        current_state = np.reshape(current_state, [1, observation_space])
        
        # Get an action by performing a forward pass through the network. 
        # max_q and logits were also calculated for DEBUGGING reasons
        action, max_q, logits = session.run([network.best_action, network.max_q, network.logits], feed_dict={network.state:current_state})
        #DEBUG: print("Max_q: ",max_q)
        #DEBUG: print("Logits: ",logits)
        
        # the network returned a one dimensional array with the best predicted action stored. The next line 
        action = action[0]
    #DEBUG: print(action_choice, " : ",action)    
    return action
    
RAW CELL!!! tf.reset_default_graph() QNetwork = DQNetwork("Q1", 2, 64, 4, 0.9, 0.1) init = tf.global_variables_initializer() with tf.Session() as sess: sess.run(init) states = [] states.append((0,4)) action = choose_egreedy_action(session=sess, current_state=states, network=QNetwork, epsilon=0.001) print(action)

Update the Target Network

As explained in the theory part, the Target-Network will be fixed for some time $C$ while the main Q-Network gets update every training/update step. Every $C$ time steps however we want to switch the networks or better, update the Target-Network with the latest information from the Q-Network. This basically means that we want to copy over all weights from the Q-Network and assign them to the Target-Network. The Q-Network itself remains unchanged. We will control this freeze frequency later inside the main loop and execute the copy process only every $C$ time steps.

Assigning new values to variables in Tensorflow can be done with the tf.assign method. But, as everything in TensorFlow, these assign operations will be tensor nodes and we have to create them before we start the session. Since they don't belong to any of the networks let us do this in an extra method.

Note that in (before) the main loop you will have to create both networks first and then hand them to the get_update_target_ops method to obtain the list of assign operations!

  • get all trainable variables of a network with tf.trainable_variables(scope=).
  • better sort the lists using sorted and the attrgetter helper, e.g Q_vars = sorted(Q_vars, key=attrgetter('name')).
  • create an empty list for the assign expressions, let's call them something like update_target_expr.
  • loop over the variable lists and create assign opertations, e.g. t_var.assign(q_var). Append them to the expression list.
  • a handy way for iterating over two lists is zip in a for loop. See the cell below for a little demo.
  • return the list of assign operations. You can later simply call sess.run(update_target_expr) to run all of the assign operations.

Use the cells below to test your implementation with some toy networks!

In [9]:
# Zip demo
x = [1,2,3]
y = [4,5,6]

for x,y in zip(x,y):
    print(x,y)
1 4
2 5
3 6
In [10]:
from operator import attrgetter

def get_update_target_ops(Q_network, Target_network):

    # You code comes here     
    # 1. get the trainable variables per network
    q_train_vars = tf.trainable_variables(scope=Q_network.scope)
    target_train_vars = tf.trainable_variables(scope=Target_network.scope)
    
    # 2. sort them with sorted(list, key=attrgetter())
    q_train_vars = sorted(q_train_vars, key=attrgetter("name"))
    target_train_vars = sorted(target_train_vars, key=attrgetter("name"))
    
    #print(q_train_vars)
    # 3.create a new list to append all assign operations
    update_target_expr = []
    
    
    for q_var, t_var in zip (q_train_vars, target_train_vars):
        # create an individual assign task for each step in the loop
        assign_task = t_var.assign(q_var)
        
        # append the assign operations to the update target expression list
        update_target_expr.append(assign_task)

    return update_target_expr
In [11]:
# TESTING get_update_target_ops with some toy networks
tf.reset_default_graph()
Q1 = DQNetwork("Q1", 1,2,1, 0.9, 0.1) 
Q2 = DQNetwork("Q2", 1,2,1, 0.9, 0.1)

Q1_vars = tf.trainable_variables(scope="Q1")
Q2_vars = tf.trainable_variables(scope="Q2")
update_target_expr = get_update_target_ops(Q1, Q2)

print("List of created Assign operations"), print(update_target_expr)
print("\n Q1 Variables"), print(Q1_vars)
print("\n Q2 Variables"), print(Q2_vars)

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())        

    print("\n Q1 Values"), print(sess.run(Q1_vars))
    print("\n Q2 Values"), print(sess.run(Q2_vars))
    
    sess.run(update_target_expr)

    print("\n Q2 Values AFTER network copy. Should now be identical to the values of Q1")
    print(sess.run(Q2_vars))  
WARNING:tensorflow:From <ipython-input-6-18e03700f04d>:19: dense (from tensorflow.python.layers.core) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.dense instead.
WARNING:tensorflow:From /home/pia4/.direnv/python-3.7.3rc1/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
List of created Assign operations
[<tf.Tensor 'Assign:0' shape=(2,) dtype=float32_ref>, <tf.Tensor 'Assign_1:0' shape=(1, 2) dtype=float32_ref>, <tf.Tensor 'Assign_2:0' shape=(1,) dtype=float32_ref>, <tf.Tensor 'Assign_3:0' shape=(2, 1) dtype=float32_ref>]

 Q1 Variables
[<tf.Variable 'Q1/hiddenLayer/kernel:0' shape=(1, 2) dtype=float32_ref>, <tf.Variable 'Q1/hiddenLayer/bias:0' shape=(2,) dtype=float32_ref>, <tf.Variable 'Q1/logits/kernel:0' shape=(2, 1) dtype=float32_ref>, <tf.Variable 'Q1/logits/bias:0' shape=(1,) dtype=float32_ref>]

 Q2 Variables
[<tf.Variable 'Q2/hiddenLayer/kernel:0' shape=(1, 2) dtype=float32_ref>, <tf.Variable 'Q2/hiddenLayer/bias:0' shape=(2,) dtype=float32_ref>, <tf.Variable 'Q2/logits/kernel:0' shape=(2, 1) dtype=float32_ref>, <tf.Variable 'Q2/logits/bias:0' shape=(1,) dtype=float32_ref>]

 Q1 Values
[array([[-1.1509987 ,  0.48768115]], dtype=float32), array([0., 0.], dtype=float32), array([[ 0.6257218 ],
       [-0.21639645]], dtype=float32), array([0.], dtype=float32)]

 Q2 Values
[array([[0.29460037, 0.05605805]], dtype=float32), array([0., 0.], dtype=float32), array([[ 0.3567407],
       [-1.3840243]], dtype=float32), array([0.], dtype=float32)]

 Q2 Values AFTER network copy. Should now be identical to the values of Q1
[array([[-1.1509987 ,  0.48768115]], dtype=float32), array([0., 0.], dtype=float32), array([[ 0.6257218 ],
       [-0.21639645]], dtype=float32), array([0.], dtype=float32)]

Train Method

Finally let's create a method to run the actual network training. Doing this in an extra method is not necessary per se but un-clutters the main loop a lot. The implementation requires the following three steps:

  1. Sample a mini-batch of experience from the replay buffer. In order to feed everything to the DQNetworks, the mixed batch must be reshaped into separate list batches ([]!) of observations, actions, rewards, next_observations and done. Make sure this true before creating corresponding feed_dicts! Depending on your replay buffer implementation you way find it handy to use zip again and apply the list operator with the map function, e.g. map(list, zip(*batch)). See the cell below for an UnZip demo.
  2. Prepare an appropriate feed_dict and get the td_target values by running the Target-Network.
  3. Prepare an appropriate feed_dict and run the update_model or train operation of the main Q-Network.

Now we can simply call train inside the main loop every time we want to train the Q-Network.

In [12]:
# UnZip demo *
mini_batch = [[1,2,3], [1,2,3], [1,2,3]]
    
for i in zip(*mini_batch):
    print(i)
(1, 1, 1)
(2, 2, 2)
(3, 3, 3)
RAW CELL!!! game = gym.make("CartPole-v0") rb = ReplayBuffer(3) action_space = game.action_space.n a = 0 for a in range(2): state = game.reset() x = 0 while x in range(0,10): random_action = np.random.randint(0,action_space) state_next, reward, terminal, info = game.step(random_action) #print("added") rb.add(state, random_action, reward, state_next, terminal) #print(state) game.render() x+=1 state = state_next a+=1 game.close() #print(rb) samp = rb.sample(2) separate_lists = list(map(list, zip(*samp))) print(separate_lists[3])
In [13]:
def train(sess, Q, Target, buffer, batch_size):

    # You code comes here     
    # 1. Sample from the replay buffer
    sample = buffer.sample(batch_size)
    
    separate_lists = list(map(list, zip(*sample)))
    
    # save each element of the separate list into their own list
    observations = separate_lists[0]
    actions = separate_lists[1]
    rewards = separate_lists[2]
    next_observations = separate_lists[3]
    done = separate_lists[4]
    
    
    # 2. Retrieve the TD-target values from the Target-Network
    # For each Batch in BatchSize
    for i in range(batch_size):
        
        # Reshape to fit Network
        next_observations[i] = np.reshape(next_observations[i], [1, observation_space])
        # Define a Feed Dictionary for td_target
        ## Feeding next_observation, rewards, done into Network
        td_target_feed = {Target.state:next_observations[i], Target.reward_placeholder:rewards[i], Target.done_placeholder:done[i]}

        
        td_target = sess.run(Target.td_target, feed_dict=td_target_feed)

        # 3. Perform a training step by running the main Q-Network

        # Reshape to fit Network
        observations[i] = np.reshape(observations[i], [1, observation_space])
        sess.run(Q.update_model, feed_dict={Q.state:observations[i], Q.td_target_placeholder: td_target, Q.actions:actions[i]})
    
    

Main Loop

Below you can see the Deep Q-Learning pseudo code from the original paper. This will help you to get at least the main parts of the algorithm right and should serve as a rough blue print of when to do what inside the loop. In practice, there are many more subtle details which are not mentioned explicitly. For that reason we provide you an additional checklist below the algorithm so you don't miss anything important. As in the last notebook we strongly recommend to start simple and work through the steps one by one! In addition, the checklist provides some sane default ranges for the hyperparameters to get you started. Try different settings and make it work :) If you have problems, better start with huge values and reduce them step by step as needed!

Hint: Personally I like to put the (hyperparameters and the preparation stuff) and the actual loop in separate code cells. You don't have to follow this convention. If you prefer one huge code cell, that's fine. Just do what works best for you!


Deep Q-Learning with experience replay
  • Initialize replay memory $D$ to capacity $N$
  • Initialize action-value function $Q$ with random weights $\theta$
  • Initialize target action-value function $\hat{Q}$ with random weights $\theta^{-}$

  • For $t = 1, T$ do

    • With probability $\epsilon$ select a random action $a_t$
    • otherwise select $a_t = \text{arg}\max_a Q(s_t,a;\theta)$

    • Execute action $a_t$ in emulator and observe reward $r_t$ and state $s_{t+1}$
    • Store transition $(s_t,a_t,r_t,s_{t+1})$ in $D$
    • Sample random minibatch transitions $(s_j,a_j,r_j,s_{j+1})$ from $D$
    • set
      $ y_j = \begin{cases}

      r_j  & \text{if episode terminates at step } j + 1 \\               
      r_j + \gamma \max_a \hat{Q}(s_{j+1}, a; \theta^{-})  & \text{otherwise}
      

      \end{cases} $

    • Perform a gradient descent step on $\big(y_j - Q(s_j,a_j;\theta)\big)^2$ with respect to the network parameters $\theta$

    • Every $C$ steps reset $\hat{Q} = Q$
  • End For

Deep Q-Learning - PIA checklist

This is a helpful checklist without any claim to completeness. Depending on your implementation you may add, change or remove parameters as you like!

Preparation and hyper parameters

  • Epsilon Schedule

    • start_epsilon $1$
    • final_epsilon $\in \{0.02,0.1\}$
    • pre_training_steps $\sim[32,...,10000]$
    • final_exploration_step $\sim [100,...,40000]$
  • Replay Buffer

    • buffer_size $N \in \{32,100,500,1000,10000,50000, ...?\}$ (Bigger is better but try small ones too!)
  • Training

    • total/max time steps T $\in \{10k,20k,30k,40k,100k\}$
    • training_freq $1$ - train the Q-Network only every $n$ steps. For now just use 1 as default.
    • switch_networks $C \sim 500$
    • gamma $\in \{0.9, 0.99, 1\}$
    • batch_size $32$
    • learning_rate $0.001$
  • Model

    • always call tf.reset_default_graph() before creating a new graph
    • get the observation_space and action_space from the game env
    • num_hidden $64$
    • create a Q_network and T_network with scope "Q" and "Target"
    • save the result from get_update_target_ops to something like update_target_network

Inside the Loop

  • Use the pseudo code as a guideline
  • Remember to train and switch the networks only after t becomes t > pre_training_steps
  • Maybe obvious but remember to set observation = new_observation for $t+1$

If the Loop runs without errors

  • Create insight

    • keep track of the episode rewards, calculate a 10 mean, 100 mean
    • print some info every $n'th \sim 2000$ time step, e.g. current step, epsilon, mean reward etc.
    • plot the epsilon schedule vs. the reward
  • Save/Load the model - needed for the test evaluation later

    • saver = tf.train.Saver() - outside the session
    • saver.save(sess, "./some_path/model.ckpt") - with an active session
    • saver.restore(sess, "./some_path/model.ckpt") - with a new session. In this case tf.global_variables_initializer() is not required. A fitting graph definition however is. If there is none left in ram, e.g. if the kernel was restartet, make sure you recreate the graph definition before restoring (at least the main Q-Network). You will need it anyway to reference nodes in the sess.run calls later.
In [14]:
import gym
import tensorflow as tf

# Create a new game
game = gym.make('CartPole-v0')
/home/pia4/.direnv/python-3.7.3rc1/lib/python3.7/site-packages/gym/envs/registration.py:17: PkgResourcesDeprecationWarning: Parameters to load are deprecated.  Call .resolve and .require separately.
  result = entry_point.load(False)
In [15]:
# Your code comes here
tf.reset_default_graph()

# Hyperparameters:
## Epsilon Schedule
start_epsilon = 1.0 
final_epsilon = [0.02 , 0.1] 
pre_training_steps = [700,5000]



## Replay Buffer
buffer_size = [500,10000]

## Training
### Max Trainsteps T
T = [10000,35000]
### Currently use 1 as Train Frequency 1 = 100% of steps are being trained
train_freq = 1
### Update the Target Network every 500 steps
switch_networks = 500
### Set Gamma
gamma = [0.9, 0.99]
### Set Batch Size
batch_size = 32
### Set Learning Rate
learn_rate = 0.001

##Model
hidden_layers = 64
In [16]:
#Create different Hyperparameter filled lists
hyperparameter_list = []

for m_final_epsilon, m_pretrain_steps, m_buffersize, m_T, m_gamma in [(m_final_epsilon, m_pretrain_steps, m_buffersize, m_T, m_gamma) for m_final_epsilon in final_epsilon for m_pretrain_steps in pre_training_steps for m_buffersize in buffer_size for m_T in T for m_gamma in gamma]:
    current_list = [m_final_epsilon, m_pretrain_steps, m_buffersize, m_T, m_gamma]
    hyperparameter_list.append(current_list)
In [17]:
# Plotting Values
epsilon_plot = []
episode_length_plot = []
episode_reward_plot = []
model_name = []
In [18]:
#To see if the GPU is being used or not
tf.test.is_gpu_available(
    cuda_only=False,
    min_cuda_compute_capability=None
)
Out[18]:
False
In [19]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import itertools

def plot_model(epsilon_plot, episode_length_plot, episode_reward_plot, plot_name):
    
    sns.set_style("darkgrid")
    pallete = sns.color_palette()

    epsilon_percentage=[]
    
    for i,epsi in enumerate(epsilon_plot):
        # Multiply with 100 - to show percentage
        epsilon_percentage.append(epsi*100)


    mean_episode_length_plot = [np.mean(episode_length_plot) for i in range(len(episode_length_plot))]
    mean_episode_reward_plot = [np.mean(episode_reward_plot)  for i in range(len(episode_length_plot))]

    fig = plt.figure(figsize=(25,8))

    # Plot return per episode  
    ax1 = fig.add_subplot(1,2,1)
    ax1.plot(episode_reward_plot, color=pallete[0])
    ax1.plot(mean_episode_reward_plot, color=pallete[1])
    ax1.plot(epsilon_percentage, color=pallete[2])
    ax1.legend(['Return','Mean','Epsilon'])
    ax1.set_title("{}".format(plot_name), fontsize=14)
    plt.ylabel("Return")
    plt.xlabel("Episode")

    #Saving Images
    #plt.savefig('./img/{}.png'.format(plot_name))
    # Show the plot
    plt.tight_layout()
    plt.show()
In [20]:
from tqdm import tqdm

current_name = ""


for current_index, current_hyperlist in enumerate(hyperparameter_list):
    current_final_epsilon = current_hyperlist[0]
    current_pretrain_steps = current_hyperlist[1]
    current_buffersize = current_hyperlist[2]
    current_T = current_hyperlist[3]
    current_gamma = current_hyperlist[4]

    
    
    temp_c_e = start_epsilon
    final_exploration_step = [current_pretrain_steps*2, current_pretrain_steps*4]

    for exploration_index,final_expl_step in enumerate(final_exploration_step):
        
        current_epsilon_plot = []
        current_episode_length_plot = []
        current_episode_reward_plot = []

        current_final_exploration_step = final_expl_step

        current_name = "final-epsiolon_{0}|pretrain-steps_{1}|final-exploration-step_{2}|buffersize_{3}|T_{4}|gamma_{5}]".format(
                                                                                                      current_final_epsilon, 
                                                                                                      current_pretrain_steps,
                                                                                                      current_final_exploration_step,
                                                                                                      current_buffersize,
                                                                                                      current_T,
                                                                                                      current_gamma)
        print_index = current_index*2+exploration_index+1
        print_all = len(hyperparameter_list)*len(final_exploration_step)
        print(print_index,"/",print_all,": ",current_name)
        
        
        
        tf.reset_default_graph()
        
        

        with tf.Session() as sess:
            
            current_epsilon_schedule = LinearSchedule(start_epsilon, current_final_epsilon, current_pretrain_steps, current_final_exploration_step)

            replay = ReplayBuffer(current_buffersize)
            Q_network = DQNetwork("Q",
                              game.observation_space.shape[0],
                              hidden_layers,
                              game.action_space.n,
                              current_gamma,
                              learn_rate)

            Target_network = DQNetwork("Target",
                                   game.observation_space.shape[0],
                                   hidden_layers,
                                   game.action_space.n,
                                   current_gamma,
                                   learn_rate)

            update_target_network = get_update_target_ops(Q_network,Target_network)

            sess.run(tf.global_variables_initializer())
            saver = tf.train.Saver(save_relative_paths=True)   

            state = game.reset()
            state = np.reshape(state, [1, game.observation_space.shape[0]])

            episode_steps = 0
            reward_in_episode = 0
            # Your code comes here
            for t in tqdm(range(current_T)):
                episode_steps+=1
                
                current_epsilon = current_epsilon_schedule.value(t)
                
                """
                DEBUGGING
                if temp_c_e < current_epsilon:
                    print("Old: ", temp_c_e)
                    print("New: ", current_epsilon)
                    print("t: ",t)
                    print("Start epsilon: ",start_epsilon)
                    print("Final Epsilon: ",current_final_epsilon)
                    print("Pretrain Steps", current_pretrain_steps)
                    print("final Exploration Step", current_final_exploration_step)
                """
                
                temp_c_e = current_epsilon

                
                    
                    
                #DEBUG: print("current epsilon: ",current_epsilon)
                action = choose_egreedy_action(session=sess, current_state=state, network=Q_network, epsilon=current_epsilon)
                #DEBUG: print("action: ", action)
                state_next, reward, terminal, info = game.step(action)
                reward_in_episode+=reward
                #DEBUG: print(episode, "-", t, state_next, reward, terminal, info)
                replay.add(state, action, reward, state_next, terminal)

                if terminal:
                    #DEBUG: print("End of Game")
                    current_epsilon_plot.append(current_epsilon)
                    current_episode_length_plot.append(episode_steps)
                    current_episode_reward_plot.append(reward_in_episode)
                    state = game.reset()
                    state = np.reshape(state, [1, game.observation_space.shape[0]])
                    episode_steps = 0
                    reward_in_episode = 0
                    

                else:
                    if t > current_pretrain_steps:
                        # Train and Update Target Network Variables
                        #DEBUG: print("Training")
                        train(sess,Q_network,Target_network,replay,batch_size)

                        if(t%switch_networks==0):
                            #DEBUG print("update Target network")
                            sess.run(update_target_network)

                    state = state_next

            #saver.save(sess, "./models/{0}.ckpt".format(current_name))
            epsilon_plot.append(current_epsilon_plot)
            episode_length_plot.append(current_episode_length_plot)
            episode_reward_plot.append(current_episode_reward_plot)
            model_name.append(current_name)
            plot_model(current_epsilon_plot,current_episode_length_plot,current_episode_reward_plot,current_name)
            
            saver.save(sess, 'models/{0}.ckpt'.format(current_name))
1 / 64 :  final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_1400|buffersize_500|T_10000|gamma_0.9]
WARNING:tensorflow:From /home/pia4/.direnv/python-3.7.3rc1/lib/python3.7/site-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
100%|██████████| 10000/10000 [04:44<00:00, 31.76it/s]
2 / 64 :  final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_2800|buffersize_500|T_10000|gamma_0.9]
100%|██████████| 10000/10000 [04:47<00:00, 34.76it/s]
3 / 64 :  final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_1400|buffersize_500|T_10000|gamma_0.99]
100%|██████████| 10000/10000 [04:46<00:00, 34.89it/s]
4 / 64 :  final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_2800|buffersize_500|T_10000|gamma_0.99]
100%|██████████| 10000/10000 [04:45<00:00, 31.58it/s]
5 / 64 :  final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_1400|buffersize_500|T_35000|gamma_0.9]
100%|██████████| 35000/35000 [17:45<00:00, 32.86it/s]
  0%|          | 0/35000 [00:00<?, ?it/s]
6 / 64 :  final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_2800|buffersize_500|T_35000|gamma_0.9]
100%|██████████| 35000/35000 [17:52<00:00, 32.57it/s]
7 / 64 :  final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_1400|buffersize_500|T_35000|gamma_0.99]
100%|██████████| 35000/35000 [17:42<00:00, 32.94it/s]
8 / 64 :  final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_2800|buffersize_500|T_35000|gamma_0.99]
100%|██████████| 35000/35000 [17:44<00:00, 32.87it/s]
9 / 64 :  final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_1400|buffersize_10000|T_10000|gamma_0.9]
100%|██████████| 10000/10000 [04:51<00:00, 34.29it/s]
10 / 64 :  final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_2800|buffersize_10000|T_10000|gamma_0.9]
100%|██████████| 10000/10000 [04:53<00:00, 34.08it/s]
11 / 64 :  final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_1400|buffersize_10000|T_10000|gamma_0.99]
100%|██████████| 10000/10000 [04:50<00:00, 31.21it/s]
12 / 64 :  final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_2800|buffersize_10000|T_10000|gamma_0.99]
100%|██████████| 10000/10000 [04:53<00:00, 34.06it/s]
13 / 64 :  final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_1400|buffersize_10000|T_35000|gamma_0.9]
100%|██████████| 35000/35000 [17:56<00:00, 32.51it/s]
14 / 64 :  final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_2800|buffersize_10000|T_35000|gamma_0.9]
100%|██████████| 35000/35000 [18:04<00:00, 32.28it/s]
15 / 64 :  final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_1400|buffersize_10000|T_35000|gamma_0.99]
100%|██████████| 35000/35000 [17:56<00:00, 32.50it/s]
16 / 64 :  final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_2800|buffersize_10000|T_35000|gamma_0.99]
100%|██████████| 35000/35000 [17:53<00:00, 32.62it/s]
17 / 64 :  final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_10000|buffersize_500|T_10000|gamma_0.9]
100%|██████████| 10000/10000 [02:33<00:00, 65.18it/s]  
18 / 64 :  final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_20000|buffersize_500|T_10000|gamma_0.9]
100%|██████████| 10000/10000 [02:32<00:00, 65.37it/s]  
19 / 64 :  final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_10000|buffersize_500|T_10000|gamma_0.99]
100%|██████████| 10000/10000 [02:34<00:00, 32.25it/s]  
20 / 64 :  final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_20000|buffersize_500|T_10000|gamma_0.99]
100%|██████████| 10000/10000 [02:33<00:00, 65.30it/s]  
21 / 64 :  final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_10000|buffersize_500|T_35000|gamma_0.9]
100%|██████████| 35000/35000 [15:37<00:00, 31.93it/s]  
22 / 64 :  final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_20000|buffersize_500|T_35000|gamma_0.9]
100%|██████████| 35000/35000 [15:33<00:00, 37.51it/s]  
23 / 64 :  final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_10000|buffersize_500|T_35000|gamma_0.99]
100%|██████████| 35000/35000 [15:40<00:00, 31.91it/s]  
  6%|â–‹         | 2191/35000 [00:00<00:01, 21904.54it/s]
24 / 64 :  final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_20000|buffersize_500|T_35000|gamma_0.99]
100%|██████████| 35000/35000 [15:39<00:00, 37.27it/s]  
25 / 64 :  final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_10000|buffersize_10000|T_10000|gamma_0.9]
100%|██████████| 10000/10000 [02:35<00:00, 30.71it/s]  
26 / 64 :  final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_10000|gamma_0.9]
100%|██████████| 10000/10000 [02:32<00:00, 31.47it/s]  
27 / 64 :  final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_10000|buffersize_10000|T_10000|gamma_0.99]
100%|██████████| 10000/10000 [02:29<00:00, 66.86it/s]  
28 / 64 :  final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_10000|gamma_0.99]
100%|██████████| 10000/10000 [02:28<00:00, 67.47it/s]  
29 / 64 :  final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_10000|buffersize_10000|T_35000|gamma_0.9]
100%|██████████| 35000/35000 [15:28<00:00, 37.69it/s]  
30 / 64 :  final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.9]
100%|██████████| 35000/35000 [15:46<00:00, 36.98it/s]  
31 / 64 :  final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_10000|buffersize_10000|T_35000|gamma_0.99]
100%|██████████| 35000/35000 [15:45<00:00, 37.03it/s]  
32 / 64 :  final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.99]
100%|██████████| 35000/35000 [15:47<00:00, 36.96it/s]  
33 / 64 :  final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_1400|buffersize_500|T_10000|gamma_0.9]
100%|██████████| 10000/10000 [04:47<00:00, 34.79it/s]
34 / 64 :  final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_2800|buffersize_500|T_10000|gamma_0.9]
100%|██████████| 10000/10000 [04:47<00:00, 34.78it/s]
  0%|          | 0/10000 [00:00<?, ?it/s]
35 / 64 :  final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_1400|buffersize_500|T_10000|gamma_0.99]
100%|██████████| 10000/10000 [04:49<00:00, 34.57it/s]
36 / 64 :  final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_2800|buffersize_500|T_10000|gamma_0.99]
100%|██████████| 10000/10000 [04:49<00:00, 34.49it/s]
37 / 64 :  final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_1400|buffersize_500|T_35000|gamma_0.9]
100%|██████████| 35000/35000 [17:47<00:00, 32.78it/s]
  0%|          | 0/35000 [00:00<?, ?it/s]
38 / 64 :  final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_2800|buffersize_500|T_35000|gamma_0.9]
100%|██████████| 35000/35000 [17:54<00:00, 32.57it/s]
39 / 64 :  final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_1400|buffersize_500|T_35000|gamma_0.99]
100%|██████████| 35000/35000 [17:53<00:00, 32.61it/s]
40 / 64 :  final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_2800|buffersize_500|T_35000|gamma_0.99]
100%|██████████| 35000/35000 [17:53<00:00, 32.60it/s]
41 / 64 :  final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_1400|buffersize_10000|T_10000|gamma_0.9]
100%|██████████| 10000/10000 [04:49<00:00, 34.60it/s]
42 / 64 :  final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_2800|buffersize_10000|T_10000|gamma_0.9]
100%|██████████| 10000/10000 [04:51<00:00, 34.25it/s]
43 / 64 :  final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_1400|buffersize_10000|T_10000|gamma_0.99]
100%|██████████| 10000/10000 [04:51<00:00, 34.25it/s]
44 / 64 :  final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_2800|buffersize_10000|T_10000|gamma_0.99]
100%|██████████| 10000/10000 [04:53<00:00, 31.16it/s]
45 / 64 :  final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_1400|buffersize_10000|T_35000|gamma_0.9]
100%|██████████| 35000/35000 [18:02<00:00, 31.63it/s]
46 / 64 :  final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_2800|buffersize_10000|T_35000|gamma_0.9]
100%|██████████| 35000/35000 [17:57<00:00, 32.49it/s]
47 / 64 :  final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_1400|buffersize_10000|T_35000|gamma_0.99]
100%|██████████| 35000/35000 [18:03<00:00, 31.67it/s]
48 / 64 :  final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_2800|buffersize_10000|T_35000|gamma_0.99]
100%|██████████| 35000/35000 [18:01<00:00, 31.61it/s]
49 / 64 :  final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_10000|buffersize_500|T_10000|gamma_0.9]
100%|██████████| 10000/10000 [02:34<00:00, 64.88it/s]  
50 / 64 :  final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_500|T_10000|gamma_0.9]
100%|██████████| 10000/10000 [02:33<00:00, 65.25it/s]  
51 / 64 :  final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_10000|buffersize_500|T_10000|gamma_0.99]
100%|██████████| 10000/10000 [02:33<00:00, 65.03it/s]  
52 / 64 :  final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_500|T_10000|gamma_0.99]
100%|██████████| 10000/10000 [02:33<00:00, 65.22it/s]  
53 / 64 :  final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_10000|buffersize_500|T_35000|gamma_0.9]
100%|██████████| 35000/35000 [15:39<00:00, 37.24it/s]  
54 / 64 :  final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_500|T_35000|gamma_0.9]
100%|██████████| 35000/35000 [15:32<00:00, 37.54it/s]  
55 / 64 :  final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_10000|buffersize_500|T_35000|gamma_0.99]
100%|██████████| 35000/35000 [15:38<00:00, 37.29it/s]  
56 / 64 :  final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_500|T_35000|gamma_0.99]
100%|██████████| 35000/35000 [15:35<00:00, 37.41it/s]  
57 / 64 :  final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_10000|buffersize_10000|T_10000|gamma_0.9]
100%|██████████| 10000/10000 [02:35<00:00, 31.20it/s]  
58 / 64 :  final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_10000|gamma_0.9]
100%|██████████| 10000/10000 [02:33<00:00, 65.24it/s]  
59 / 64 :  final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_10000|buffersize_10000|T_10000|gamma_0.99]
100%|██████████| 10000/10000 [02:36<00:00, 64.06it/s]  
  0%|          | 0/10000 [00:00<?, ?it/s]
60 / 64 :  final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_10000|gamma_0.99]
100%|██████████| 10000/10000 [02:34<00:00, 64.79it/s]  
61 / 64 :  final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_10000|buffersize_10000|T_35000|gamma_0.9]
100%|██████████| 35000/35000 [15:43<00:00, 37.10it/s]  
  6%|â–‹         | 2205/35000 [00:00<00:01, 22049.65it/s]
62 / 64 :  final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.9]
100%|██████████| 35000/35000 [15:47<00:00, 36.93it/s]  
63 / 64 :  final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_10000|buffersize_10000|T_35000|gamma_0.99]
100%|██████████| 35000/35000 [15:43<00:00, 37.09it/s]  
64 / 64 :  final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.99]
100%|██████████| 35000/35000 [15:44<00:00, 37.04it/s]  

Plot statistics

Similarily to the Random Buffer, we created a class Mean Buffer to create a limited List of Mean Values with the size buffer_size. For example a Mean 10 Score can be calculated by creating a MeanBuffer(10) After this MeanBuffer contains 10 values, the next value added drops the first item in the list. This way we can create an updated List of buffer_size values.

In [36]:
import random as r
from collections import deque

class MeanBuffer():
    def __init__(self, buffer_size):
        self.buffer = deque([], maxlen=buffer_size)
    
    def add(self, reward):
        self.buffer.append(reward)

    def length(self):
        return(len(self.buffer))
    
    def sample(self):
        return(self.buffer)
In [55]:
# You code comes here
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import itertools

for index,model in enumerate(model_name):
    
    print(model)
    
    sns.set_style("darkgrid")
    pallete = sns.color_palette()

    epsilon_percentage=[]
    
    for i,epsi in enumerate(epsilon_plot[index]):
        # Multiply with 100 - to show percentage
        epsilon_percentage.append(epsi*100)


    mean_episode_length_plot = [np.mean(episode_length_plot[index]) for i in range(len(episode_length_plot[index]))]
    mean_episode_reward_plot = [np.mean(episode_reward_plot[index])  for i in range(len(episode_length_plot[index]))]
    
    mean_reward_ten_buffer = MeanBuffer(10)
    mean_reward_hundred_buffer = MeanBuffer(100)
    
    mean_reward_ten_list = []
    mean_reward_hundred_list = []
    
    for reward_index, reward in enumerate(episode_reward_plot[index]):
        mean_reward_ten_buffer.add(reward)
        mean_reward_hundred_buffer.add(reward)
        
        #DEBUG print(sum(mean_reward_ten_buffer.sample()))
        
        mean_reward_ten_list.append(sum(mean_reward_ten_buffer.sample())/mean_reward_ten_buffer.length())
            
        mean_reward_hundred_list.append(sum(mean_reward_hundred_buffer.sample())/mean_reward_hundred_buffer.length())
        
        
    
    
    
    
    
    fig = plt.figure(figsize=(25,8))

    # Plot return per episode  
    ax1 = fig.add_subplot(1,2,1)
    ax1.plot(episode_reward_plot[index], color=pallete[0], linewidth=0.5)
    ax1.plot(mean_episode_reward_plot, color=pallete[1])
    ax1.plot(mean_reward_ten_list, color=pallete[2], linewidth=3)
    ax1.plot(mean_reward_hundred_list, color=pallete[3], linewidth=3)
    ax1.plot(epsilon_percentage, color=pallete[4])
    ax1.legend(['Return','Mean','Mean-10','Mean-100','Epsilon'])
    ax1.set_title("Return per Episode", fontsize=14)
    plt.ylabel("Return")
    plt.xlabel("Episode")

    plt.savefig('./img/{}.png'.format(model))
    # Show the plot
    plt.tight_layout()
    plt.show()
final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_1400|buffersize_500|T_10000|gamma_0.9]
final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_2800|buffersize_500|T_10000|gamma_0.9]
final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_1400|buffersize_500|T_10000|gamma_0.99]
final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_2800|buffersize_500|T_10000|gamma_0.99]
final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_1400|buffersize_500|T_35000|gamma_0.9]
final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_2800|buffersize_500|T_35000|gamma_0.9]
final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_1400|buffersize_500|T_35000|gamma_0.99]
final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_2800|buffersize_500|T_35000|gamma_0.99]
final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_1400|buffersize_10000|T_10000|gamma_0.9]
final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_2800|buffersize_10000|T_10000|gamma_0.9]
final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_1400|buffersize_10000|T_10000|gamma_0.99]
final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_2800|buffersize_10000|T_10000|gamma_0.99]
final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_1400|buffersize_10000|T_35000|gamma_0.9]
final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_2800|buffersize_10000|T_35000|gamma_0.9]
final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_1400|buffersize_10000|T_35000|gamma_0.99]
final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_2800|buffersize_10000|T_35000|gamma_0.99]
final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_10000|buffersize_500|T_10000|gamma_0.9]
final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_20000|buffersize_500|T_10000|gamma_0.9]
final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_10000|buffersize_500|T_10000|gamma_0.99]
final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_20000|buffersize_500|T_10000|gamma_0.99]
final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_10000|buffersize_500|T_35000|gamma_0.9]
final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_20000|buffersize_500|T_35000|gamma_0.9]
final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_10000|buffersize_500|T_35000|gamma_0.99]
final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_20000|buffersize_500|T_35000|gamma_0.99]
final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_10000|buffersize_10000|T_10000|gamma_0.9]
final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_10000|gamma_0.9]
final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_10000|buffersize_10000|T_10000|gamma_0.99]
final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_10000|gamma_0.99]
final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_10000|buffersize_10000|T_35000|gamma_0.9]
final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.9]
final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_10000|buffersize_10000|T_35000|gamma_0.99]
final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.99]
final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_1400|buffersize_500|T_10000|gamma_0.9]
final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_2800|buffersize_500|T_10000|gamma_0.9]
final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_1400|buffersize_500|T_10000|gamma_0.99]
final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_2800|buffersize_500|T_10000|gamma_0.99]
final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_1400|buffersize_500|T_35000|gamma_0.9]
final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_2800|buffersize_500|T_35000|gamma_0.9]
final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_1400|buffersize_500|T_35000|gamma_0.99]
final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_2800|buffersize_500|T_35000|gamma_0.99]
final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_1400|buffersize_10000|T_10000|gamma_0.9]
final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_2800|buffersize_10000|T_10000|gamma_0.9]
final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_1400|buffersize_10000|T_10000|gamma_0.99]
final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_2800|buffersize_10000|T_10000|gamma_0.99]
final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_1400|buffersize_10000|T_35000|gamma_0.9]
final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_2800|buffersize_10000|T_35000|gamma_0.9]
final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_1400|buffersize_10000|T_35000|gamma_0.99]
final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_2800|buffersize_10000|T_35000|gamma_0.99]
final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_10000|buffersize_500|T_10000|gamma_0.9]
final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_500|T_10000|gamma_0.9]
final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_10000|buffersize_500|T_10000|gamma_0.99]
final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_500|T_10000|gamma_0.99]
final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_10000|buffersize_500|T_35000|gamma_0.9]
final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_500|T_35000|gamma_0.9]
final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_10000|buffersize_500|T_35000|gamma_0.99]
final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_500|T_35000|gamma_0.99]
final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_10000|buffersize_10000|T_10000|gamma_0.9]
final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_10000|gamma_0.9]
final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_10000|buffersize_10000|T_10000|gamma_0.99]
final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_10000|gamma_0.99]
final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_10000|buffersize_10000|T_35000|gamma_0.9]
final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.9]
final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_10000|buffersize_10000|T_35000|gamma_0.99]
final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.99]

Evaluation of Deep Reinforcement Learning Algorithms

In general, the evaluation of deep RL is discussed controversially among researchers since it remains unclear how to benchmark and compare such algorithms properly. Is the return or average return a good performance measure? How big is the impact of hyperparameters vs. general algorithm vs. implementation etc.? See the paper Deep Reinforcement Learning that Matters from Henderson et al. 2017 for a nice overview of these problems.

As part of this notebook however, we will evaluate our algorithm as done by the authors of DQN. The testing is very simple. Let the trained agent play the game $30$ times with an e-greedy policy with a fixed $\epsilon = 0.05$ and report the average high score (return).

  • Load a pre-trained agent, potentially recreate a Q-Network graph.
  • Run the agent for 30 episodes with an evaluation_epsilon = 0.05.
  • Plot or print the results in a decent way.
In [72]:
game = gym.make('CartPole-v0')
# You code comes here
eval_episodes = [50, 100, 250]
fixed_epsilon = 0.05

best_overall_mean_model=""

tmp_mean_current_reward = 0

for current_name in model_name:
    
    tf.reset_default_graph()
    
    saver = tf.train.import_meta_graph("./models/{0}.ckpt.meta".format(current_name)) 
    
    temp_split_string = current_name.split("gamma_",1)[1] 
    #DEBUG print(temp_split_string)
    
    temp_split_string = temp_split_string.split("]",1)[0]
    #DEBUG print(temp_split_string)
    current_gamma = float(temp_split_string)
    

    with tf.Session() as sess:
        Q_network = DQNetwork("Q",game.observation_space.shape[0],hidden_layers,game.action_space.n,current_gamma,learn_rate)        
        
        sess.run(tf.global_variables_initializer())
        
        saver.restore(sess, tf.train.latest_checkpoint('./models/'.format(current_name)))

        #DEBUG print(tf.all_variables())
        #print(saver)
        
        for episode_length in eval_episodes:
            mean_reward_ten_list = []
            mean_reward_hundred_list = []
            current_rewards = []
            mean_reward_ten_buffer = MeanBuffer(10)
            mean_reward_hundred_buffer = MeanBuffer(25)
            
            for epi in range(episode_length):
                state = game.reset()
                state = np.reshape(state, [1, game.observation_space.shape[0]])
                x = 0
                while x in range(0,200):
                    action = choose_egreedy_action(sess, state, Q_network, fixed_epsilon)

                    state_next, reward, terminal, info = game.step(action)

                    if terminal:
                        #print("End of Game!!!!")
                        #print(current_name, "\n",epi,": ",x)
                        mean_reward_ten_buffer.add(x)
                        mean_reward_hundred_buffer.add(x)
                        mean_reward_ten_list.append(sum(mean_reward_ten_buffer.sample())/mean_reward_ten_buffer.length())
                        mean_reward_hundred_list.append(sum(mean_reward_hundred_buffer.sample())/mean_reward_hundred_buffer.length())
                        
                        current_rewards.append(x)
                        break

                    else:
                        #game.render()
                        x+=1
                        state = state_next   

            ### Plotting a Graph
            fig = plt.figure(figsize=(25,12))

            mean_current_rewards = [np.mean(current_rewards) for i in range(len(current_rewards))]








            #DEBUG print(sum(mean_reward_ten_buffer.sample()))



             # Plot return per episode  
            ax1 = fig.add_subplot(1,2,1)
            ax1.plot(current_rewards, color=pallete[0], linewidth=0.5)
            ax1.plot(mean_reward_ten_list, color=pallete[1], linewidth=3)
            ax1.plot(mean_reward_hundred_list, color=pallete[2], linewidth=3)
            ax1.plot(mean_current_rewards, color=pallete[3])
            ax1.legend(['Reward','Mean-10', 'Mean-100', 'Mean'])
            ax1.set_title("{0}-{1}".format(current_name, episode_length), fontsize=10)
            plt.ylabel("Return")
            plt.xlabel("Episode")

            plt.savefig('./output/{0}_{1}-Episodes.png'.format(current_name, episode_length))
            # Show the plot
            plt.tight_layout()
            plt.show()

        if tmp_mean_current_reward < mean_current_rewards[0]:
            tmp_mean_current_reward = mean_current_rewards[0]
            best_overall_mean_model = current_name
        
       
/home/pia4/.direnv/python-3.7.3rc1/lib/python3.7/site-packages/gym/envs/registration.py:17: PkgResourcesDeprecationWarning: Parameters to load are deprecated.  Call .resolve and .require separately.
  result = entry_point.load(False)
INFO:tensorflow:Restoring parameters from ./models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.99].ckpt
INFO:tensorflow:Restoring parameters from ./models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.99].ckpt
INFO:tensorflow:Restoring parameters from ./models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.99].ckpt
INFO:tensorflow:Restoring parameters from ./models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.99].ckpt
INFO:tensorflow:Restoring parameters from ./models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.99].ckpt
INFO:tensorflow:Restoring parameters from ./models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.99].ckpt
INFO:tensorflow:Restoring parameters from ./models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.99].ckpt
INFO:tensorflow:Restoring parameters from ./models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.99].ckpt
INFO:tensorflow:Restoring parameters from ./models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.99].ckpt
INFO:tensorflow:Restoring parameters from ./models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.99].ckpt
INFO:tensorflow:Restoring parameters from ./models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.99].ckpt
INFO:tensorflow:Restoring parameters from ./models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.99].ckpt
INFO:tensorflow:Restoring parameters from ./models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.99].ckpt
INFO:tensorflow:Restoring parameters from ./models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.99].ckpt
INFO:tensorflow:Restoring parameters from ./models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.99].ckpt
INFO:tensorflow:Restoring parameters from ./models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.99].ckpt
INFO:tensorflow:Restoring parameters from ./models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.99].ckpt
INFO:tensorflow:Restoring parameters from ./models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.99].ckpt
INFO:tensorflow:Restoring parameters from ./models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.99].ckpt
INFO:tensorflow:Restoring parameters from ./models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.99].ckpt
INFO:tensorflow:Restoring parameters from ./models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.99].ckpt
INFO:tensorflow:Restoring parameters from ./models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.99].ckpt
INFO:tensorflow:Restoring parameters from ./models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.99].ckpt
INFO:tensorflow:Restoring parameters from ./models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.99].ckpt
INFO:tensorflow:Restoring parameters from ./models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.99].ckpt
INFO:tensorflow:Restoring parameters from ./models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.99].ckpt
INFO:tensorflow:Restoring parameters from ./models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.99].ckpt
INFO:tensorflow:Restoring parameters from ./models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.99].ckpt
INFO:tensorflow:Restoring parameters from ./models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.99].ckpt
INFO:tensorflow:Restoring parameters from ./models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.99].ckpt
INFO:tensorflow:Restoring parameters from ./models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.99].ckpt
INFO:tensorflow:Restoring parameters from ./models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.99].ckpt
INFO:tensorflow:Restoring parameters from ./models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.99].ckpt
INFO:tensorflow:Restoring parameters from ./models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.99].ckpt
INFO:tensorflow:Restoring parameters from ./models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.99].ckpt
INFO:tensorflow:Restoring parameters from ./models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.99].ckpt
INFO:tensorflow:Restoring parameters from ./models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.99].ckpt
INFO:tensorflow:Restoring parameters from ./models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.99].ckpt
INFO:tensorflow:Restoring parameters from ./models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.99].ckpt
INFO:tensorflow:Restoring parameters from ./models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.99].ckpt
INFO:tensorflow:Restoring parameters from ./models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.99].ckpt
INFO:tensorflow:Restoring parameters from ./models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.99].ckpt
INFO:tensorflow:Restoring parameters from ./models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.99].ckpt
INFO:tensorflow:Restoring parameters from ./models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.99].ckpt
INFO:tensorflow:Restoring parameters from ./models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.99].ckpt
INFO:tensorflow:Restoring parameters from ./models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.99].ckpt
INFO:tensorflow:Restoring parameters from ./models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.99].ckpt
INFO:tensorflow:Restoring parameters from ./models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.99].ckpt
INFO:tensorflow:Restoring parameters from ./models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.99].ckpt
INFO:tensorflow:Restoring parameters from ./models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.99].ckpt
INFO:tensorflow:Restoring parameters from ./models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.99].ckpt
INFO:tensorflow:Restoring parameters from ./models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.99].ckpt
INFO:tensorflow:Restoring parameters from ./models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.99].ckpt
INFO:tensorflow:Restoring parameters from ./models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.99].ckpt
INFO:tensorflow:Restoring parameters from ./models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.99].ckpt
INFO:tensorflow:Restoring parameters from ./models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.99].ckpt
INFO:tensorflow:Restoring parameters from ./models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.99].ckpt
INFO:tensorflow:Restoring parameters from ./models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.99].ckpt
INFO:tensorflow:Restoring parameters from ./models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.99].ckpt
INFO:tensorflow:Restoring parameters from ./models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.99].ckpt
INFO:tensorflow:Restoring parameters from ./models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.99].ckpt
INFO:tensorflow:Restoring parameters from ./models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.99].ckpt
INFO:tensorflow:Restoring parameters from ./models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.99].ckpt
INFO:tensorflow:Restoring parameters from ./models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.99].ckpt
In [74]:
print("Best Model: ",best_overall_mean_model)
print("Best Reward: ", tmp_mean_current_reward)
Best Model:  final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_1400|buffersize_500|T_35000|gamma_0.9]
Best Reward:  161.52

Best Evaluation

We have gone over all models and checked which Model had the best overall mean value

Random Agent in Comparision

Training

50 Episodes

100 Episodes

250 Episodes

Commenting Training and selected Hyperparameters

Method

For this experiment we chose to train 64 different Models.
These different models were the result of using 6 hyperparameters with 2 different values each

$n_{models} = 2^6 = 64$

The selected Hyperparameters with two differnet values were:

  • final epsilon [ 0.02 | 0.1 ]
  • pre training steps [ 700 | 5000 ]
  • final exploration step [ 2x Multiplier | 4x Multiplier ]
  • buffer size [ 500 | 10000 ]
  • T (Number of total training steps [ 10000 | 35000 ]
  • gamma [ 0.9 | 0.99 ]

For the hyperparameter final exploration step we chose to use a multiplier. We multiplied the current pre training steps either with 2 or with 4. That way we always had a relative epsilon decrease dependant on the number of pretraining steps.

Results

Training

final-epsiolon_0.02_pretrain-steps_700_final-exploration-step_2800_buffersize_10000_T_10000_gamma_0.99]

In this model we have a short Pretraining Phase of 700 steps. The Epsilon Decay is chosen to be a longer version by multipling the pretrain steps by 4.

$final exploration step = 4*pretrain steps = 4*700 = 2800$

The Buffersize has been chosen to be large with 10000 stored samples. At the same time the total Training Steps has been chosen low at also 10000. That means each step is actually stored inside the Replay Buffer. Gamma has been chosen at 0.99

During the first 130 Episodes, the Model does not achieve great results. But then, when Epsilon has reached a low value of around 0.2, the Model suddenly improves and at the end of the 10000 Steps, the model achieves great results. During the last few Episodes, this model achieves very high values. We can also see that the Mean 10 scores and Mean 100 scores constantly rise. But unfortunately, all of the good results happen in the last quarter of the training time.

Let us first take a look at the Evaluation of this model. Afterwards, we can take a look at the model with more Trainingsteps and compare both of them.

50 Episodes

100 Episodes

250 Episodes

In all 3 evaluations we can clearly see Bad results with a low Mean Reward Value of only around 9. This does seem odd. Let's investigate further.

Next we take a look at the Model with more Training Steps: final-epsiolon_0.02_pretrain-steps_700_final-exploration-step_2800_buffersize_10000_T_35000_gamma_0.99]

Here we can see, that this model does have high peaks, but also lower values after having already reached a high peak. Another difference we can see is, that it takes this model longer to start getting better results. We have already reached the final epsilon value but it need some more episodes to achieve better results. Let's take a look at the evaluation

50 Episodes

100 Episodes

250 Episodes

We can see much better results here than in the T=10000 Steps variant. Now we reach a 68 Reward Mean. But we are still not close to the maximum. Let's see what happens if we reduce the buffersize:

final-epsiolon_0.02_pretrain-steps_700_final-exploration-step_2800_buffersize_500_T_35000_gamma_0.99]
50 Episodes

100 Episodes

250 Episodes

As expected a lower Buffersize of the Replay Buffer results in worse results. Lets see what happens if we increase the pretrain steps (we return the Buffersize to 10000)

final-epsiolon_0.02_pretrain-steps_5000_final-exploration-step_20000_buffersize_10000_T_35000_gamma_0.99]
50 Episodes

100 Episodes

250 Episodes

Similar to the first model, we have a good training graph, but it ends too soon. To see better results here, we would probably need to increase the Training steps. But we can also try to decrease the final explorationsteps

final-epsiolon_0.02_pretrain-steps_5000_final-exploration-step_10000_buffersize_10000_T_35000_gamma_0.99]
50 Episodes

100 Episodes

250 Episodes

Apparently decreasing the Final Exploration Step led to worse results. So lets return to our best model yet (lower pretrain steps with 4X multiplication and large buffer size) and increas the final epsilon value

final-epsiolon_0.1_pretrain-steps_700_final-exploration-step_2800_buffersize_10000_T_35000_gamma_0.99]

50 Episodes

100 Episodes

250 Episodes

Well, we have found a better working model. So why is it better to use more random based actions and not rely so much on the trained Network. Perhaps because we haven't trianed it long enough to get good results. Here we should definately try out a longer exploration time. Let's take a look yet again, with longer pretraining

final-epsiolon_0.1_pretrain-steps_5000_final-exploration-step_20000_buffersize_10000_T_35000_gamma_0.99]

50 Episodes

100 Episodes

250 Episodes

Well, not so good. Even though we let the agent explore longer, we got worse results. Of course we should include the fact into our analysis, that the exploitation steps were much less in this model. So perhaps a longer Timestep Training would be wide to test here. Another thing that could be tested would be to keep the pretraining steps quite low, but increase the final exploration steps. That way the decrease of epsilon would go slower and the agent would have more time adapt from random actions to predicted actions. Well, let's return to our best model yet and swap out our last hyperparameter. Gamma:

final-epsiolon_0.1_pretrain-steps_700_final-exploration-step_2800_buffersize_10000_T_35000_gamma_0.9]

50 Episodes

100 Episodes

250 Episodes

Well, what can we learn from all this? So far not very much. It would need more testing, especially with higher training steps. Another hyperparameter we have completele let out of the loop for this experiment is the learning rate of the network. Our best model out of all 64 has reached a good reward mean value of 161. But at the same time, if we take a look at the individual rewards per episode, we can clearly see, that the rewards are jumping around, which indicates, that this model doesn't really know what it's doing. This may come from a low buffersize, or short pretraining and short final exploration step.

For further experiment we definalt advise to increase the Training steps, keep a high buffer size and perhaps increas the final exploration step and play around with some other hyperparameters such as the learning rate and of course use a different network architecture for the Neural network.

Further Ideas

  • How fast can you train the agent to a test score > 195? In other words, what is the smallest amount of training steps you need to achieve this goal?
  • Can you get a mean 100 score > 199?

    We suppose it is possible, but at the same time hard to achieve. We would need to train a perfect agent. Epsilon Final should be at around 0, because we cannot rely on a random action to be good. It might ruin the perfect mean 100 score. I suppose it would need a better neural network architechture to predict better actions and a much longer training time. Another thing to try would be to try out different epsilon schedules. Possible examples could be:

$argmax(0,((T-(t/T))*\alpha)*sin(\beta*t)))$

or

$(T-(t/T))*\alpha*|sin(\beta*t)|$

$\alpha$ : Hyperparameter to modify the maximum Amplitude
$\beta$ : Hyperparameter to modify the stretching factor in X-Direction (modify frequency)
T : Maximum training steps
t : current step

with $\alpha$ being a stretching aplitude hyperparameter we can modify the maximum aplitude. $\alpha$ could also be connected to the total Time steps so that it gradually decreases over time. We could also stretch the sinus curve in the frequency direction with the $\beta$ variable. With such an implementation, we would not simply stop exploring after reaching the final epsilon value for the first time, but instead start exploring again.

  • As always, experiment with Hyperparameters/Network sizes etc. and reason about their effects/importance!
  • Implement and experiment with new/different exploration schemes.
  • Extend the algorithm to play Atari games.

    Because time ran out, we weren't able to implement all of our ideas, or try the Atari Implementation with Deep Reinforcement Learning.

Playing Atari with Deep Reinforcement Learning

So far you have implemented a basic DQN agent. For simplicity we have left out some important details which are crucial in order to play video games. If you are eager to do this anyway here are the missing parts.

Architecture

First of all, switch the simple MLP with the following architecture from the paper. Note that there are no pooling layers in this CNN!

DQN

Observability and Preprocessing

We have not talked about observability so far. Formally, Atari video games are Partially Observable Markov Decision Processes or POMDPs. This means that the game screen is not a sufficient observation to fully describe the underlying state and that the Markov Assumption does not hold. A simple example to makes this clear. Think about the game Pong. Given only one frame, the agent has no way of telling if the ball is currently moving from left to right or from right to left. For that reason the authors used the last 4 frames of the game as observation. This turns the POMDP into an MDP again. Furthermore they applied some more preprocessing steps to the game screens such as turning them into gray scale, rescaling, and taking the max out of two subsequent frames. Please see the methods section of the DQN paper for more details on that.

Hint: In order to implement this you have to keep some sort of frame buffer of the last 4 frames etc.!

Training Details

In order to play faster, the authors trained the network only every $K=4$ time step. In between, the last taken action was repeated. This allows to play more games, e.g. gather more experience in less time since stepping the emulator forward is computationally cheaper than training the network. Again see the paper for details.

During training, the authors clipped the reward to the range $R \in \{-1,0,1\}$. Remember to remove this constraint during testing again to get the real high score

Another way to improve training stability was to clip the gradients (or better the squared L2 loss) to the range -1,1. In other words, only apply L2 if the error is inside this range and take a linear loss outside. This corresponds to a Huber Loss. Here you can find a TensorFlow implementations of that from the OpenAI baseline agents. Please read the Wikipedia to see what the function is doing.

The original DQN implementation used a slightly modified version of RMSProp as its optimizer. You dont have to implement this. It is perfectly fine to stay with Adam for instance. However, be aware that the learning rate is a really crucial parameter in this context. If at least for any, this is definetly the first (and probably the most important) hyperparameter for which you want to test different settings!

Hyper Parameters

See the paper for a list of good default parameters. Due to the lengthy training times you may want to reduce the total amount of time steps the agent will be trained. You also may adjust the exploration accordingly. However, be aware that exploration time is very important. You may want to benchmark a very short training first and then do some rough calculations of how long it will take to train the agent for some $x$ time steps etc. Then plan some experiments.

In [ ]:
 
In [31]:
!tar chvfz notebook.tar.gz *
PIA_RL_DQN.ipynb
PIA_RL_DynamicProgramming.ipynb
PIA_RL_Intro.ipynb
PIA_RL_QLearning.ipynb
Pics/
Pics/cbowGramArchitecture.png
Pics/semanticRelations.PNG
Pics/lexicalContrastInjection.PNG
Pics/overAllPicture.png
Pics/semanticRelatednessVis.png
Pics/CRISPsmall.png
Pics/dsm.png
Pics/bowVsEmbedding.png
Pics/bilingualEmbedding.PNG
Pics/skipGramTrainSamples.png
Pics/wordMappings.png
Pics/feedfile.png
Pics/collobertSimilarities.PNG
Pics/mlCategories.png
Pics/dataTypes.png
Pics/skipGramArchitecture.png
Pics/CBowTrainSamples.png
Pics/crispIndallnodep.svg
Pics/crispIndallnodep.png
RL_pics/
RL_pics/AgentEnvLoop.png
RL_pics/DQN_principle.png
RL_pics/Atari_games.png
RL_pics/DQN_architecture.png
RL_pics/RL_taxonomy.png
img/
img/final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_1400|buffersize_500|T_10000|gamma_0.9].png
img/final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_10000|buffersize_10000|T_10000|gamma_0.9].png
img/final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_2800|buffersize_500|T_10000|gamma_0.9].png
img/final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_2800|buffersize_500|T_35000|gamma_0.99].png
img/final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.9].png
img/final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_1400|buffersize_500|T_10000|gamma_0.9].png
img/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.9].png
img/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_10000|buffersize_500|T_35000|gamma_0.99].png
img/final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_2800|buffersize_500|T_10000|gamma_0.99].png
img/final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_10000|gamma_0.99].png
img/final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_2800|buffersize_10000|T_35000|gamma_0.99].png
img/final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_1400|buffersize_10000|T_10000|gamma_0.9].png
img/final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_20000|buffersize_500|T_35000|gamma_0.9].png
img/final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_10000|buffersize_500|T_10000|gamma_0.9].png
img/final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_10000|gamma_0.9].png
img/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_10000|buffersize_500|T_10000|gamma_0.99].png
img/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.99].png
img/final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_1400|buffersize_500|T_35000|gamma_0.9].png
img/final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_1400|buffersize_10000|T_10000|gamma_0.9].png
img/final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_20000|buffersize_500|T_35000|gamma_0.99].png
img/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_10000|buffersize_10000|T_10000|gamma_0.9].png
img/final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_2800|buffersize_500|T_10000|gamma_0.9].png
img/final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_1400|buffersize_500|T_10000|gamma_0.99].png
img/final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_1400|buffersize_500|T_35000|gamma_0.9].png
img/final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_10000|buffersize_500|T_35000|gamma_0.9].png
img/final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_1400|buffersize_10000|T_10000|gamma_0.99].png
img/final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_1400|buffersize_10000|T_35000|gamma_0.9].png
img/final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_2800|buffersize_500|T_35000|gamma_0.9].png
img/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_10000|buffersize_10000|T_35000|gamma_0.9].png
img/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_10000|buffersize_10000|T_35000|gamma_0.99].png
img/final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_1400|buffersize_10000|T_35000|gamma_0.99].png
img/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_500|T_35000|gamma_0.99].png
img/final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_2800|buffersize_10000|T_35000|gamma_0.9].png
img/final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_2800|buffersize_10000|T_35000|gamma_0.99].png
img/final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_2800|buffersize_500|T_35000|gamma_0.99].png
img/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_500|T_35000|gamma_0.9].png
img/final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_1400|buffersize_500|T_35000|gamma_0.99].png
img/final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_1400|buffersize_500|T_10000|gamma_0.99].png
img/final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_2800|buffersize_10000|T_10000|gamma_0.9].png
img/final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_20000|buffersize_500|T_10000|gamma_0.99].png
img/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_500|T_10000|gamma_0.9].png
img/final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_10000|buffersize_10000|T_10000|gamma_0.99].png
img/final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_1400|buffersize_10000|T_35000|gamma_0.99].png
img/final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_2800|buffersize_500|T_35000|gamma_0.9].png
img/final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_2800|buffersize_500|T_10000|gamma_0.99].png
img/final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_10000|buffersize_500|T_35000|gamma_0.99].png
img/final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_10000|buffersize_500|T_10000|gamma_0.99].png
img/final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_10000|buffersize_10000|T_35000|gamma_0.99].png
img/final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_1400|buffersize_10000|T_35000|gamma_0.9].png
img/final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_2800|buffersize_10000|T_10000|gamma_0.99].png
img/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_10000|buffersize_500|T_35000|gamma_0.9].png
img/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_10000|buffersize_500|T_10000|gamma_0.9].png
img/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_500|T_10000|gamma_0.99].png
img/final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_20000|buffersize_500|T_10000|gamma_0.9].png
img/final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.99].png
img/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_10000|gamma_0.99].png
img/final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_1400|buffersize_10000|T_10000|gamma_0.99].png
img/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_10000|gamma_0.9].png
img/final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_10000|buffersize_10000|T_35000|gamma_0.9].png
img/final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_2800|buffersize_10000|T_10000|gamma_0.99].png
img/final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_1400|buffersize_500|T_35000|gamma_0.99].png
img/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_10000|buffersize_10000|T_10000|gamma_0.99].png
img/final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_2800|buffersize_10000|T_10000|gamma_0.9].png
img/final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_2800|buffersize_10000|T_35000|gamma_0.9].png
models/
models/final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_20000|buffersize_500|T_10000|gamma_0.99].ckpt.data-00000-of-00001
models/final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_2800|buffersize_10000|T_35000|gamma_0.99].ckpt.index
models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.9].ckpt.index
models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_10000|gamma_0.99].ckpt.index
models/final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_10000|buffersize_500|T_35000|gamma_0.9].ckpt.index
models/final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_1400|buffersize_500|T_10000|gamma_0.9].ckpt.index
models/final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_1400|buffersize_10000|T_10000|gamma_0.99].ckpt.meta
models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_10000|buffersize_500|T_10000|gamma_0.99].ckpt.index
models/final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_20000|buffersize_500|T_35000|gamma_0.99].ckpt.meta
models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_500|T_35000|gamma_0.99].ckpt.meta
models/final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_1400|buffersize_500|T_10000|gamma_0.9].ckpt.index
models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_500|T_10000|gamma_0.9].ckpt.data-00000-of-00001
models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_10000|buffersize_10000|T_35000|gamma_0.99].ckpt.index
models/final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_1400|buffersize_10000|T_35000|gamma_0.9].ckpt.index
models/final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_2800|buffersize_500|T_10000|gamma_0.9].ckpt.meta
models/final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_1400|buffersize_500|T_10000|gamma_0.9].ckpt.data-00000-of-00001
models/final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_2800|buffersize_10000|T_10000|gamma_0.9].ckpt.meta
models/final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_10000|gamma_0.9].ckpt.meta
models/final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_2800|buffersize_10000|T_10000|gamma_0.9].ckpt.index
models/final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.99].ckpt.meta
models/final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_1400|buffersize_10000|T_10000|gamma_0.99].ckpt.data-00000-of-00001
models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_10000|buffersize_500|T_10000|gamma_0.9].ckpt.meta
models/final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_20000|buffersize_500|T_10000|gamma_0.9].ckpt.meta
models/final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_10000|buffersize_10000|T_10000|gamma_0.9].ckpt.meta
models/final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_1400|buffersize_10000|T_10000|gamma_0.99].ckpt.data-00000-of-00001
models/final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_1400|buffersize_500|T_10000|gamma_0.9].ckpt.data-00000-of-00001
models/final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_1400|buffersize_10000|T_10000|gamma_0.9].ckpt.meta
models/final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_2800|buffersize_500|T_35000|gamma_0.99].ckpt.data-00000-of-00001
models/final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_1400|buffersize_500|T_35000|gamma_0.99].ckpt.index
models/final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_1400|buffersize_10000|T_35000|gamma_0.9].ckpt.meta
models/final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_1400|buffersize_10000|T_35000|gamma_0.99].ckpt.meta
models/final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_1400|buffersize_10000|T_35000|gamma_0.9].ckpt.meta
models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_10000|gamma_0.99].ckpt.data-00000-of-00001
models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.99].ckpt.meta
models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_10000|buffersize_500|T_10000|gamma_0.9].ckpt.data-00000-of-00001
models/final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_1400|buffersize_10000|T_10000|gamma_0.9].ckpt.index
models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_500|T_35000|gamma_0.9].ckpt.index
models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_10000|gamma_0.99].ckpt.meta
models/final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_10000|buffersize_10000|T_10000|gamma_0.9].ckpt.index
models/final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_2800|buffersize_10000|T_10000|gamma_0.99].ckpt.data-00000-of-00001
models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.9].ckpt.meta
models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_500|T_10000|gamma_0.9].ckpt.index
models/final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_10000|buffersize_500|T_10000|gamma_0.99].ckpt.data-00000-of-00001
models/final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_10000|buffersize_500|T_10000|gamma_0.9].ckpt.index
models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_10000|buffersize_10000|T_10000|gamma_0.9].ckpt.meta
models/final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_2800|buffersize_10000|T_10000|gamma_0.99].ckpt.index
models/final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_1400|buffersize_10000|T_10000|gamma_0.99].ckpt.meta
models/final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_2800|buffersize_10000|T_35000|gamma_0.99].ckpt.data-00000-of-00001
models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_10000|gamma_0.9].ckpt.meta
models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_10000|buffersize_10000|T_10000|gamma_0.99].ckpt.meta
models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_10000|gamma_0.9].ckpt.data-00000-of-00001
models/final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_10000|buffersize_500|T_10000|gamma_0.99].ckpt.meta
models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_10000|buffersize_500|T_10000|gamma_0.99].ckpt.data-00000-of-00001
models/final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_2800|buffersize_10000|T_35000|gamma_0.99].ckpt.data-00000-of-00001
models/final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_2800|buffersize_10000|T_35000|gamma_0.9].ckpt.meta
models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_500|T_10000|gamma_0.9].ckpt.meta
models/final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_1400|buffersize_10000|T_10000|gamma_0.99].ckpt.index
models/final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_2800|buffersize_10000|T_35000|gamma_0.9].ckpt.meta
models/final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_2800|buffersize_10000|T_10000|gamma_0.9].ckpt.meta
models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_500|T_35000|gamma_0.99].ckpt.data-00000-of-00001
models/.ipynb_checkpoints/
models/final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_10000|buffersize_10000|T_35000|gamma_0.99].ckpt.meta
models/final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_2800|buffersize_500|T_10000|gamma_0.9].ckpt.data-00000-of-00001
models/final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_1400|buffersize_500|T_35000|gamma_0.9].ckpt.data-00000-of-00001
models/final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_10000|buffersize_10000|T_35000|gamma_0.99].ckpt.data-00000-of-00001
models/final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_10000|buffersize_10000|T_35000|gamma_0.9].ckpt.index
models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_10000|buffersize_10000|T_35000|gamma_0.9].ckpt.data-00000-of-00001
models/final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_1400|buffersize_500|T_35000|gamma_0.99].ckpt.meta
models/final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_1400|buffersize_500|T_10000|gamma_0.9].ckpt.meta
models/final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_1400|buffersize_500|T_10000|gamma_0.99].ckpt.index
models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_10000|buffersize_500|T_35000|gamma_0.99].ckpt.index
models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.99].ckpt.index
models/final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_10000|buffersize_500|T_35000|gamma_0.9].ckpt.meta
models/final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_2800|buffersize_500|T_35000|gamma_0.99].ckpt.data-00000-of-00001
models/final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_2800|buffersize_500|T_35000|gamma_0.9].ckpt.index
models/final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_2800|buffersize_500|T_10000|gamma_0.99].ckpt.meta
models/final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_2800|buffersize_500|T_10000|gamma_0.99].ckpt.meta
models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.9].ckpt.data-00000-of-00001
models/final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_1400|buffersize_500|T_35000|gamma_0.99].ckpt.data-00000-of-00001
models/final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_2800|buffersize_10000|T_35000|gamma_0.99].ckpt.meta
models/final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_10000|buffersize_500|T_10000|gamma_0.99].ckpt.index
models/final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_2800|buffersize_500|T_35000|gamma_0.99].ckpt.index
models/final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_2800|buffersize_10000|T_35000|gamma_0.99].ckpt.index
models/final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_1400|buffersize_500|T_35000|gamma_0.9].ckpt.index
models/final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_10000|buffersize_500|T_10000|gamma_0.9].ckpt.meta
models/final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_2800|buffersize_500|T_10000|gamma_0.99].ckpt.data-00000-of-00001
models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_500|T_10000|gamma_0.99].ckpt.meta
models/final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_10000|buffersize_10000|T_35000|gamma_0.9].ckpt.meta
models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_10000|buffersize_10000|T_10000|gamma_0.99].ckpt.data-00000-of-00001
models/final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.9].ckpt.data-00000-of-00001
models/checkpoint
models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_10000|gamma_0.9].ckpt.index
models/final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_1400|buffersize_10000|T_35000|gamma_0.99].ckpt.meta
models/final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_20000|buffersize_500|T_35000|gamma_0.9].ckpt.index
models/final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_10000|buffersize_500|T_10000|gamma_0.9].ckpt.data-00000-of-00001
models/final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_10000|buffersize_10000|T_10000|gamma_0.99].ckpt.data-00000-of-00001
models/final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_2800|buffersize_500|T_35000|gamma_0.9].ckpt.index
models/final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_2800|buffersize_500|T_10000|gamma_0.99].ckpt.data-00000-of-00001
models/final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_2800|buffersize_10000|T_35000|gamma_0.9].ckpt.data-00000-of-00001
models/final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_10000|buffersize_500|T_35000|gamma_0.99].ckpt.meta
models/final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_2800|buffersize_500|T_35000|gamma_0.99].ckpt.meta
models/final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_10000|buffersize_500|T_35000|gamma_0.99].ckpt.data-00000-of-00001
models/final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_10000|buffersize_10000|T_35000|gamma_0.99].ckpt.index
models/final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_2800|buffersize_10000|T_35000|gamma_0.99].ckpt.meta
models/final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_2800|buffersize_10000|T_10000|gamma_0.99].ckpt.meta
models/final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.9].ckpt.index
models/final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_2800|buffersize_10000|T_10000|gamma_0.9].ckpt.index
models/final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_2800|buffersize_500|T_35000|gamma_0.9].ckpt.meta
models/final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_1400|buffersize_500|T_35000|gamma_0.9].ckpt.index
models/final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_2800|buffersize_500|T_10000|gamma_0.99].ckpt.index
models/final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_20000|buffersize_500|T_10000|gamma_0.9].ckpt.index
models/final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_1400|buffersize_10000|T_10000|gamma_0.99].ckpt.index
models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_500|T_35000|gamma_0.9].ckpt.meta
models/final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_20000|buffersize_500|T_10000|gamma_0.99].ckpt.index
models/final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_2800|buffersize_10000|T_35000|gamma_0.9].ckpt.index
models/final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_20000|buffersize_500|T_10000|gamma_0.9].ckpt.data-00000-of-00001
models/final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_10000|gamma_0.99].ckpt.data-00000-of-00001
models/final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_1400|buffersize_500|T_35000|gamma_0.9].ckpt.meta
models/final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_10000|buffersize_10000|T_35000|gamma_0.9].ckpt.data-00000-of-00001
models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_10000|buffersize_10000|T_10000|gamma_0.9].ckpt.data-00000-of-00001
models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_500|T_10000|gamma_0.99].ckpt.index
models/final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_10000|buffersize_10000|T_10000|gamma_0.9].ckpt.data-00000-of-00001
models/final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_2800|buffersize_10000|T_35000|gamma_0.9].ckpt.data-00000-of-00001
models/final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_2800|buffersize_10000|T_10000|gamma_0.99].ckpt.meta
models/final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_2800|buffersize_10000|T_10000|gamma_0.9].ckpt.data-00000-of-00001
models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_10000|buffersize_10000|T_10000|gamma_0.9].ckpt.index
models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_10000|buffersize_500|T_35000|gamma_0.9].ckpt.data-00000-of-00001
models/final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_2800|buffersize_500|T_35000|gamma_0.99].ckpt.meta
models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_10000|buffersize_10000|T_35000|gamma_0.9].ckpt.index
models/final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_10000|gamma_0.9].ckpt.index
models/final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_1400|buffersize_10000|T_35000|gamma_0.9].ckpt.data-00000-of-00001
models/final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_10000|buffersize_10000|T_10000|gamma_0.99].ckpt.index
models/final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.9].ckpt.meta
models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_500|T_35000|gamma_0.9].ckpt.data-00000-of-00001
models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.99].ckpt.data-00000-of-00001
models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_10000|buffersize_500|T_35000|gamma_0.9].ckpt.meta
models/final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_2800|buffersize_10000|T_35000|gamma_0.9].ckpt.index
models/final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_10000|buffersize_500|T_35000|gamma_0.99].ckpt.index
models/final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_20000|buffersize_500|T_10000|gamma_0.99].ckpt.meta
models/final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_10000|gamma_0.9].ckpt.data-00000-of-00001
models/final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_2800|buffersize_10000|T_10000|gamma_0.9].ckpt.data-00000-of-00001
models/final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_2800|buffersize_500|T_35000|gamma_0.9].ckpt.meta
models/final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_1400|buffersize_10000|T_10000|gamma_0.9].ckpt.index
models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_10000|buffersize_10000|T_10000|gamma_0.99].ckpt.index
models/final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_1400|buffersize_10000|T_10000|gamma_0.9].ckpt.data-00000-of-00001
models/final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_2800|buffersize_500|T_10000|gamma_0.9].ckpt.meta
models/final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_20000|buffersize_500|T_35000|gamma_0.99].ckpt.data-00000-of-00001
models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_10000|buffersize_500|T_35000|gamma_0.99].ckpt.data-00000-of-00001
models/final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_20000|buffersize_500|T_35000|gamma_0.9].ckpt.data-00000-of-00001
models/final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.99].ckpt.index
models/final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_2800|buffersize_500|T_10000|gamma_0.9].ckpt.data-00000-of-00001
models/final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_2800|buffersize_500|T_35000|gamma_0.99].ckpt.index
models/final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_1400|buffersize_10000|T_35000|gamma_0.99].ckpt.data-00000-of-00001
models/final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_1400|buffersize_500|T_35000|gamma_0.99].ckpt.index
models/final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.99].ckpt.data-00000-of-00001
models/final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_1400|buffersize_500|T_10000|gamma_0.99].ckpt.index
models/final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_1400|buffersize_10000|T_35000|gamma_0.99].ckpt.index
models/final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_1400|buffersize_500|T_10000|gamma_0.9].ckpt.meta
models/final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_1400|buffersize_500|T_10000|gamma_0.99].ckpt.meta
models/final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_1400|buffersize_500|T_35000|gamma_0.9].ckpt.meta
models/final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_2800|buffersize_500|T_10000|gamma_0.9].ckpt.index
models/final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_20000|buffersize_500|T_35000|gamma_0.99].ckpt.index
models/final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_2800|buffersize_500|T_10000|gamma_0.99].ckpt.index
models/final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_10000|buffersize_500|T_35000|gamma_0.9].ckpt.data-00000-of-00001
models/final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_1400|buffersize_500|T_10000|gamma_0.99].ckpt.data-00000-of-00001
models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_10000|buffersize_10000|T_35000|gamma_0.99].ckpt.data-00000-of-00001
models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_10000|buffersize_10000|T_35000|gamma_0.9].ckpt.meta
models/final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_1400|buffersize_10000|T_10000|gamma_0.9].ckpt.meta
models/final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_1400|buffersize_500|T_10000|gamma_0.99].ckpt.meta
models/final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_1400|buffersize_500|T_35000|gamma_0.99].ckpt.meta
models/final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_2800|buffersize_500|T_10000|gamma_0.9].ckpt.index
models/final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_1400|buffersize_10000|T_35000|gamma_0.9].ckpt.data-00000-of-00001
models/final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_1400|buffersize_500|T_35000|gamma_0.9].ckpt.data-00000-of-00001
models/final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_1400|buffersize_500|T_35000|gamma_0.99].ckpt.data-00000-of-00001
models/final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_10000|gamma_0.99].ckpt.index
models/final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_1400|buffersize_500|T_10000|gamma_0.99].ckpt.data-00000-of-00001
models/final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_10000|buffersize_10000|T_10000|gamma_0.99].ckpt.meta
models/final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_2800|buffersize_500|T_35000|gamma_0.9].ckpt.data-00000-of-00001
models/final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_2800|buffersize_10000|T_10000|gamma_0.99].ckpt.data-00000-of-00001
models/final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_1400|buffersize_10000|T_35000|gamma_0.99].ckpt.data-00000-of-00001
models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_10000|buffersize_500|T_35000|gamma_0.99].ckpt.meta
models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_10000|buffersize_500|T_10000|gamma_0.9].ckpt.index
models/final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_2800|buffersize_500|T_35000|gamma_0.9].ckpt.data-00000-of-00001
models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_500|T_10000|gamma_0.99].ckpt.data-00000-of-00001
models/final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_1400|buffersize_10000|T_10000|gamma_0.9].ckpt.data-00000-of-00001
models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_500|T_35000|gamma_0.99].ckpt.index
models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_10000|buffersize_500|T_35000|gamma_0.9].ckpt.index
models/final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_1400|buffersize_10000|T_35000|gamma_0.9].ckpt.index
models/final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_1400|buffersize_10000|T_35000|gamma_0.99].ckpt.index
models/final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_20000|buffersize_500|T_35000|gamma_0.9].ckpt.meta
models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_10000|buffersize_500|T_10000|gamma_0.99].ckpt.meta
models/final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_10000|gamma_0.99].ckpt.meta
models/final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_2800|buffersize_10000|T_10000|gamma_0.99].ckpt.index
models/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_10000|buffersize_10000|T_35000|gamma_0.99].ckpt.meta
output/
output/final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_1400|buffersize_500|T_10000|gamma_0.9].png
output/final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_10000|buffersize_10000|T_10000|gamma_0.9].png
output/final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_2800|buffersize_500|T_10000|gamma_0.9].png
output/final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_2800|buffersize_500|T_35000|gamma_0.99].png
output/final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.9].png
output/final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_1400|buffersize_500|T_10000|gamma_0.9].png
output/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.9].png
output/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_10000|buffersize_500|T_35000|gamma_0.99].png
output/final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_2800|buffersize_500|T_10000|gamma_0.99].png
output/final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_10000|gamma_0.99].png
output/final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_2800|buffersize_10000|T_35000|gamma_0.99].png
output/final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_1400|buffersize_10000|T_10000|gamma_0.9].png
output/final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_20000|buffersize_500|T_35000|gamma_0.9].png
output/final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_10000|buffersize_500|T_10000|gamma_0.9].png
output/final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_10000|gamma_0.9].png
output/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_10000|buffersize_500|T_10000|gamma_0.99].png
output/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.99].png
output/final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_1400|buffersize_500|T_35000|gamma_0.9].png
output/final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_1400|buffersize_10000|T_10000|gamma_0.9].png
output/final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_20000|buffersize_500|T_35000|gamma_0.99].png
output/.ipynb_checkpoints/
output/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_10000|buffersize_10000|T_10000|gamma_0.9].png
output/final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_2800|buffersize_500|T_10000|gamma_0.9].png
output/final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_1400|buffersize_500|T_10000|gamma_0.99].png
output/final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_1400|buffersize_500|T_35000|gamma_0.9].png
output/final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_10000|buffersize_500|T_35000|gamma_0.9].png
output/final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_1400|buffersize_10000|T_10000|gamma_0.99].png
output/final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_1400|buffersize_10000|T_35000|gamma_0.9].png
output/final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_2800|buffersize_500|T_35000|gamma_0.9].png
output/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_10000|buffersize_10000|T_35000|gamma_0.9].png
output/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_10000|buffersize_10000|T_35000|gamma_0.99].png
output/final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_1400|buffersize_10000|T_35000|gamma_0.99].png
output/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_500|T_35000|gamma_0.99].png
output/final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_2800|buffersize_10000|T_35000|gamma_0.9].png
output/final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_2800|buffersize_10000|T_35000|gamma_0.99].png
output/final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_2800|buffersize_500|T_35000|gamma_0.99].png
output/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_500|T_35000|gamma_0.9].png
output/final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_1400|buffersize_500|T_35000|gamma_0.99].png
output/final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_1400|buffersize_500|T_10000|gamma_0.99].png
output/final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_2800|buffersize_10000|T_10000|gamma_0.9].png
output/final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_20000|buffersize_500|T_10000|gamma_0.99].png
output/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_500|T_10000|gamma_0.9].png
output/final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_10000|buffersize_10000|T_10000|gamma_0.99].png
output/final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_1400|buffersize_10000|T_35000|gamma_0.99].png
output/final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_2800|buffersize_500|T_35000|gamma_0.9].png
output/final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_2800|buffersize_500|T_10000|gamma_0.99].png
output/final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_10000|buffersize_500|T_35000|gamma_0.99].png
output/final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_10000|buffersize_500|T_10000|gamma_0.99].png
output/final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_10000|buffersize_10000|T_35000|gamma_0.99].png
output/final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_1400|buffersize_10000|T_35000|gamma_0.9].png
output/final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_2800|buffersize_10000|T_10000|gamma_0.99].png
output/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_10000|buffersize_500|T_35000|gamma_0.9].png
output/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_10000|buffersize_500|T_10000|gamma_0.9].png
output/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_500|T_10000|gamma_0.99].png
output/final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_20000|buffersize_500|T_10000|gamma_0.9].png
output/final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_35000|gamma_0.99].png
output/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_10000|gamma_0.99].png
output/final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_1400|buffersize_10000|T_10000|gamma_0.99].png
output/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_20000|buffersize_10000|T_10000|gamma_0.9].png
output/final-epsiolon_0.02|pretrain-steps_5000|final-exploration-step_10000|buffersize_10000|T_35000|gamma_0.9].png
output/final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_2800|buffersize_10000|T_10000|gamma_0.99].png
output/final-epsiolon_0.1|pretrain-steps_700|final-exploration-step_1400|buffersize_500|T_35000|gamma_0.99].png
output/final-epsiolon_0.1|pretrain-steps_5000|final-exploration-step_10000|buffersize_10000|T_10000|gamma_0.99].png
output/final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_2800|buffersize_10000|T_10000|gamma_0.9].png
output/final-epsiolon_0.02|pretrain-steps_700|final-exploration-step_2800|buffersize_10000|T_35000|gamma_0.9].png